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Foreword 



Physical design of integrated circuits remains one of the most interesting and chal- 
lenging arenas in the field of Electronic Design Automation. The ability to integrate 
more and more devices on our silicon chips requires the algorithms to continuously 
scale up. Nowadays we can integrate 2e9 transistors on a single 45nm-technology 
chip. This number will continue to scale for the next couple of technology genera- 
tions, requiring more transistors to be automatically placed on a chip and connected 
together. In addition, more and more of the delay is contributed by the wires that 
interconnect the devices on the chip. This has a profound effect on how physical 
design flows need to be put together. In the 1990s, it was safe to assume that timing 
goals of the design could be reached once the devices were placed well on the chip. 
Today, one does not know whether the timing constraints can be satisfied until the 
final routing has completed. 

As far back as 10 or 15 years ago, people believed that most physical design prob- 
lems had been solved. But, the continued increase in the number of transistors on 
the chip, as well as the increased coupling between the physical, timing and logic 
domains warrant a fresh look at the basic algorithmic foundations of chip implemen- 
tation. That is exactly what this book provides. It covers the basic algorithms under- 
lying all physical design steps and also shows how they are applied to current in- 
stances of the design problems. For example. Chapter 7 provides a great deal of 
information on special types of routing for specific design situations. 

Several other books provide in-depth descriptions of core physical design algorithms 
and the underlying mathematics, but this book goes a step further. The authors very 
much realize that the era of individual point algorithms with single objectives is over. 
Throughout the book they emphasize the multi-objective nature of modern design 
problems and they bring all the pieces of a physical design flow together in Chapter 
8. A complete flow chart, from design partitioning and floorplanning all the way to 
electrical rule checking, describes all phases of the modem chip implementation 
flow. Each step is described in the context of the overall flow with references to the 
preceding chapters for the details. 

This book will be appreciated by students and professionals alike. It starts from the 
basics and provides sufficient background material to get the reader up to speed on 
the real issues. Each of the chapters by itself provides sufficient introduction and 
depth to be very valuable. This is especially important in the present era, where 
experts in one area must understand the effects of their algorithms on the remainder 
of the design flow. An expert in routing will derive great benefit from reading the 
chapters on planning and placement. An expert in Design For Manufacturability 
(DFM) who seeks a better understanding of routing algorithms, and of how these 
algorithms can be affected by choices made in setting DFM requirements, will bene- 
fit tremendously from the chapters on global and detailed routing. 



The book is completed by a detailed set of solutions to the exercises that accompany 
each chapter. The exercises force the student to truly understand the basic physical 
design algorithms and apply them to small but insightful problem instances. 

This book will serve the EDA and design community well. It will be a foundational 
text and reference for the next generation of professionals who will be called on to 
continue the advancement of our chip design tools. 

Dr. Leon Stok 

Vice President, Electronic Design Automation 
IBM Systems and Technology Group 
Hopewell Junction, NY 
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Preface 



VLSI physical design of integrated circuits underwent explosive development in the 
1980s and 1990s. Many basic techniques were suggested by researchers and imple- 
mented in commercial tools, but only described in brief conference publications 
geared for experts in the field. In the 2000s, academic and industry researchers fo- 
cused on comparative evaluation of basic techniques, their extension to large-scale 
optimization, and the assembly of point optimizations into multi-objective design 
flows. Our book covers these aspects of physical design in a consistent way, starting 
with basic concepts in Chapter 1 and gradually increasing the depth to reach ad- 
vanced concepts, such as physical synthesis. Readers seeking additional details, will 
find a number of references discussed in each chapter, including specialized mono- 
graphs and recent conference publications. 

Chapter 2 covers netlist partitioning. It first discusses typical problem formulations 
and proceeds to classic algorithms for balanced graph and hypergraph partitioning. 
The last section covers an important application - system partitioning among multi- 
ple FPGAs, used in the context of high-speed emulation in functional validation. 

Chapter 3 is dedicated to chip planning, which includes floorplanning, power- 
ground planning and I/O assignment. A broad range of topics and techniques are 
covered, ranging from graph-theoretical aspects of block-packing to optimization by 
simulated annealing and package-aware I/O planning. 

Chapter 4 addresses VLSI placement and covers a number of practical problem 
formulations. It distinguishes between global and detailed placement, and first cov- 
ers several algorithmic frameworks traditionally used for global placement. De- 
tailed placement algorithms are covered in a separate section. Current state of the art 
in placement is reviewed, with suggestions to readers who might want to imple- 
ment their own software tools for large-scale placement. 

Chapters 5 and 6 discuss global and detailed routing, which have received signifi- 
cant attention in research literature due to their interaction with manufacturability 
and chip-yield optimizations. Topics covered include representing layout with graph 
models and performing routing, for single and multiple nets, in these models. State- 
of-the-art global routers are discussed, as well as yield optimizations performed in 
detailed routing to address specific types of manufacturing faults. 

Chapter 7 deals with several specialized types of routing which do not conform with 
the global-detailed paradigm followed by Chapters 5 and 6. These include non- 
Manhattan area routing, commonly used in PCBs, and clock-tree routing required 
for every synchronous digital circuit. In addition to algorithmic aspects, we explore 
the impact of process variability on clock-tree routing and means of decreasing this 
impact. 
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Chapter 8 focuses on timing closure, and its perspective is particularly unique. It 
offers a comprehensive coverage of timing analysis and relevant optimizations in 
placement, routing and netlist restructuring. Section 8.6 assembles all these tech- 
niques, along with those covered in earlier chapters, into an extensive design flow, 
illustrated in detail with a flow chart and discussed step-by-step with several figures 
and many references. 

This book does not assume prior exposure to physical design or other areas of EDA. 
It introduces the reader to the EDA industry and basic EDA concepts, covers key 
graph concepts and algorithm analysis, carefully defines terms and specifies basic 
algorithms with pseudocode. Many illustrations are given throughout the book, and 
every chapter includes a set of exercises, solutions to which are given in one of the 
appendices. Unlike most other sources on physical design, we made an effort to 
avoid impractical and unnecessarily complicated algorithms. In many cases we offer 
comparisons between several leading algorithmic techniques and refer the reader to 
publications with additional empirical results. 

Some chapters are based on material in the book Layoutsynthese elektronischer 
Schaltungen - Grundlegende Algorithmen fir die Entwurfsautomatisierung, which 
was published by Springer in 2006. 

We are grateful to our colleagues and students who proofread earlier versions of this 
book and suggested a number of improvements (in alphabetical order): Matthew 
Guthaus, Kwangok Jeong, Johann Knechtel, Andreas Krinke, Nancy MacDonald, 
Jarrod Roy, Yen-Kuan Wu and Hailong Yao. 

Images for global placement and clock routing in Chapter 8 were provided by 
Myung-Chul Kim and Dong- Jin Lee. Cell libraries in Appendix B were provided by 
Bob Bullock, Dan Clein and Bill Lye from PMC Sierra; the layout and schematics in 
Appendix B were generated by Matthias Thiele. The work on this book was partially 
supported by the National Science Foundation (NSF) through the CAREER 
award 0448189 as well as by Texas Instruments and Sun Microsystems. 

We hope that you will find the book interesting to read and useful in your profes- 
sional endeavors. 

Sincerely, 

Andrew, Jens, Igor and Jin 
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1 Introduction 



The design and optimization of integrated circuits (ICs) are essential to the produc- 
tion of new semiconductor chips. Modem chip design has become so complex that it 
is largely performed by specialized software, which is frequently updated to reflect 
improvements in semiconductor technologies and increasing design complexities. A 
user of this software needs a high-level understanding of the implemented algo- 
rithms. On the other hand, a developer of this software must have a strong computer- 
science background, including a keen understanding of how various algorithms 
operate and interact, and what their performance bottlenecks are. 

This book introduces and evaluates algorithms used during physical design to pro- 
duce a geometric chip layout from an abstract circuit design. Rather than list every 
relevant technique, however, it presents the essential and fundamental algorithms 
used within each physical design stage. 

— Partitioning (Chap. 2) and chip planning (Chap. 3) of design functionality 
during the initial stages of physical design 

— Geonxeinc placement (Chap. 4) and routing (Chaps. 5-6) of circuit components 

— Specialized routing and clock tree synthesis for synchronous circuits (Chap. 7) 

— Meeting specific technology and performance requirements, i.e., timing closure, 
such that the final fabricated layout satisfies system objectives (Chap. 8) 

Other design steps, such as circuit design, logic synthesis, transistor-level layout and 
verification, are not discussed in detail, but are covered in such references as [1.1]. 

This book emphasizes digital circuit design for veiy large-scale integration {VLSI); 
the degree of automation for digital circuits is significantly higher than for analog 
circuits. In particular, the focus is on algorithms for digital ICs, such as system parti- 
tioning iov field-programmable gate arrays (FPGAs) or clock network synthesis for 
application-specifiic integrated circuits (ASICs). Similar design techniques can be 
applied to other implementation contexts such as multi-chip modules (MCMs) and 
printed circuit boards (PCBs). 

The following broad questions, of interest to both students and designers, are ad- 
dressed in the upcoming chapters. 

— How is functionally correct layout produced from a netlist? 

— How does software for VLSI physical design work? 

— How do we develop and improve software for VLSI physical design? 

More information about this book is at http://vlsicad.eecs.umich.edu/KLMH/. 



A. B. Kahng et al., VLSI Physical Design: From Graph Partitioning to Timing Closure, 
DOI 10.1007/978-90-481-9591-6_l, © Springer Science+Business Media B.V. 2011 
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JJ 1.1 Electronic Design Automation (EDA) 



The Electronic Design Automation (EDA) industry develops software to support 
engineers in the creation of new integrated-circuit (IC) designs. Due to the high 
complexity of modem designs, EDA touches almost every aspect of the IC design 
flow, from high-level system design to fabrication. EDA addresses designers' needs 
at multiple levels of electronic system hierarchy, including integrated circuits (ICs), 
multi-chip modules (MCMs), and printed circuit boards (PCBs). 

Progress in semiconductor technology, based on Moore's Law (Fig. 1.1), has led to 
integrated circuits (1) comprised of hundreds of millions of transistors, (2) assem- 
bled into packages, each having multiple chips and thousands of pins, and (3) 
mounted onto high-density interconnect (HDI) circuit boards with dozens of wiring 
layers. This design process is highly complex and heavily depends on automated 
tools. That is, computer software is used to mostly automate design steps such as 
logic design, simulation, physical design, and verification. 

EDA was first used in the 1960s in the fonn of simple programs to automate place- 
ment of a very small number of blocks on a circuit board. Over the next few years, 
the advent of the integrated circuit created a need for software that could reduce the 
total number of gates. Current software tools must additionally consider electrical 
effects such as signal delays and capacitive coupling between adjacent wires. In the 
modem VLSI design flow, nearly all steps use software to automate optimizations. 

In the 1970s, semiconductor companies developed in-house EDA software, special- 
ized programs to address their proprietary design styles. In the 1980s and 1990s, 
independent software vendors created new tools for more widespread use. This gave 
rise to an independent EDA industry, which now enjoys annual revenues of ap- 
proximately five billion dollars and employs around twenty thousand people. Many 
EDA companies have headquarters in Santa Clara county, in the state of California. 
This area has been aptly dubbed the Silicon Valley. 

Several annual conferences showcase the progress of the EDA industry and acade- 
mia. The most notable one is the Design Automation Conference (DAC), which 
holds an industry trade show as well as an academic symposium. The International 
Conference on Computer-Aided Design (ICCAD) places emphasis on academic 
research, with papers that relate to specialized algorithm development. PCB devel- 
opers attend PCB Design Conference West in September. Overseas, Europe and 
Asia host the Design, Automation and Test in Europe {DATE?} conference and the 
Asia and South Pacific Design Automation Conference (ASP-DAO), respectively. 
The world-wide engineering association Institute of Electrical and Electronic Engi- 
neers {IEEE?} publishes the monthly IEEE Transactions on Computer-Aided Design 
of Integrated Circuits and Systems (TCAD), while the Association for Computing 
Machineiy (ACM) publishes ACM Transactions on Design Automation of Electronic 
Systems (TODAES). 
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Impact of EDA. According to Moore's Law (Fig. 1.1), the number of transistors on 
a chip is increasing at an exponential rate. Historically, this corresponds to an annual 
compounded increase of 58% in the number of transistors per chip. 







Fig. 1.1 Moore's Law and the original gi'aph fi'om Gordon 
Moore's article [1.10] predicting the increase in number of 
transistors. In 1965, Gordon Moore (Fairchild) stated that 
the number of transistors on an IC would double every 
year. Ten years later, he revised his statement, asserting 
that doubling would occur every 18 months. Since then, 
this "rule" has been famously known as Moore's Law. 



However, chip designs produced by prominent semiconductor companies suggest a 
different trend. The annual productivity, measured by the number of transistors, of 
designers, and (fixed-size) design teams has an annual compounded growth of only 
around 2 1% per year, leading to a design productivity gap [1.5]. Since the number of 
transistors is highly context-specific - analog versus digital or memory versus logic 
- this statistic, due to SEMATECH in the mid-1990s, refers to the design productiv- 
ity for standardized transistors. 

Fig 1 .2, reproduced from the International Technology Roadmap for Semiconduc- 
tors (ITRS) [1.5], demonstrates that cost-feasible IC products require innovation in 
EDA technology. Given the availability of efficient design technologies to semicon- 
ductor design teams, the hardware design cost of a typical portable system-on-chip, 
e.g., a baseband processor chip for a cell phone, remains manageable at $15.7 mil- 
lion (2009 estimate). With associated software design costs, the overall chip design 
project cost is $45.3 million. Without the design technology innovations between 
1993 and 2007, and the resulting design productivity improvements, the design cost 
of a chip would have been $ 1,800 million, well over a billion dollars. 

ITRS 2009 Cost Chart 
(in Millions of Dollars) 
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Fig. 1.2 Recent editions of the semiconductor technology roadmap project total hardware (HW) 
engineering costs + EDA tool costs (dark gray) and total software (SW) engineering costs + elec- 
tronic system design automation (ESDA) tool costs (light gray). This shows the impact of EDA 
technologies on overall IC design productivity and hence IC design cost [1.5]. 
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History of EDA. After tools for schematic entry of integrated circuits were devel- 
oped, the first EDA tool, a placer that optimized the physical locations of devices on 
a circuit board, was created in the late 1960s. Shortly thereafter, programs were 
written to aid circuit layout and visualization. The first integrated-circuit computer- 
aided design {CAD) systems addressed the physical design process and were written 
in the 1970s. During that era, most CAD tools were proprietary - major companies 
such as IBM and AT&T Bell Laboratories relied on software tools designed for 
internal use only. However, beginning in the 1980s, independent software develop- 
ers started to write tools that could serve the needs of multiple semiconductor prod- 
uct companies. The electronic design automation (EDA) market grew rapidly in the 
1990s, as many design teams adopted commercial tools instead of developing their 
own in-house versions. The largest EDA software vendors today are, in alphabetical 
order. Cadence Design Systems, Mentor Graphics, and Synopsys. 

EDA tools have always been geared toward automating the entire design process 
and linking the design steps into a complete design flow. However, such integration 
is challenging, since some design steps need additional degrees of freedom, and 
scalability requires tackling some steps independently. On the other hand, the con- 
tinued decrease of transistor and wire dimensions has blurred the boundaries and 
abstractions that separate successive design steps - physical effects such as signal 
delays and coupling capacitances need to be accurately accounted for earlier in the 
design cycle. Thus, the design process is moving from a sequence of atomic (inde- 
pendent) steps toward a deeper level of integration. Tab. 1 . 1 summarizes a timeline 
of key developments in circuit and physical design. 

Tab. 1.1 Timeline of EDA progress witli respect to circuit and piiysical design. 

Time Period Circuit and Physical Design Process Advancements 

1950-1965 Manual design only. 

1965-1975 Layout editors, e.g., place and route tools, first developed for 
printed circuit boards. 

1975-1985 More advanced tools for ICs and PCBs, with more sophisti- 

cated algorithms. 

1 985-1990 First performance-driven tools and parallel optimization algo- 

rithms for layout; better understanding of underlying theory 
(graph theory, solution complexity, etc.). 

1990-2000 First over-the-cell routing, first 3D and multilayer placement 

and routing techniques developed. Automated circuit synthe- 
sis and routability-oriented design become dominant. Start of 
parallelizing workloads. Emergence of physical synthesis. 

2000-now Design for Manufacturability (DFM), optical proximity cor- 

rection (OPC), and other techniques emerge at the design- 
manufacturing interface. Increased reusability of blocks, in- 
cluding intellectual property (IP) blocks. 
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1.2 



The process of designing a very large-scale integrated (VLSI) circuit is highly com- 
plex. It can be separated into distinct steps (Fig. 1.3). Earlier steps are high-level; 
later design steps are at lower levels of abstraction. At the end of the process, before 
fabrication, tools and algorithms operate on detailed information about each circuit 
element's geometric shape and electrical properties. 
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Fig. 1.3 The major steps in the VLSI circuit design flow, with a focus on the physical design steps: 
partitioning (Chap. 2), chip planning (Chap. 3), placement (Chap. 4), clock tree synthesis (Chap. 7), 
routing (Chaps. 5-6), and timing closure (Chap. 8). 

The steps of the VLSI design flow in Fig. 1.3 are discussed in detail below. For 
further reading on physical design algorithms, see [1.1]. Books [1.6], [1.9], [1.11] 
and [1.12] cover other specialized topics that are not addressed here. 

System specification. Chip architects, circuit designers, product marketers, opera- 
tions managers, and layout and library designers collectively define the overall goals 
and high-level requirements of the system. These goals and requirements span func- 
tionality, performance, physical dimensions and production technology. 



Architectural design. A basic architecture must be determined to meet the system 
specifications. Example decisions are 
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— Integration of analog and mixed-signal blocks 

— Memory management - serial or parallel - and the addressing scheme 

— Number and types of computational cores, such as processors and digital signal 
processing (DSP) units - and particular DSP algorithms 

— Internal and external communication, support for standard protocols, etc. 

— Usage of hard and soft intellectual-property (IP) blocks 

— Pinout, packaging, and the die-package interface 

— Power requirements 

— Choice of process technology and layer stacks 

Functional and logic design. Once the architecture is set, the functionality and 
connectivity of each module (such as a processor core) must be defined. During 
functional design, only the high-level behavior must be determined. That is, each 
module has a set of inputs, outputs, and timing behavior. 

Logic design is performed at the register-transfer level (RTL) using a hardware 
description language (HDL) by means of programs that define the functional and 
timing behavior of a chip. Two common HDLs are Verilog and VHDL. HDL mod- 
ules must be thoroughly simulated and verified. 

Logic synthesis tools automate the process of converting HDL into low-level circuit 
elements. That is, given a Verilog or VHDL description and a technology library, a 
logic synthesis tool can map the described functionality to a list of signal nets, or 
netlist, and specific circuit elements such as standard cells and transistors. 

Circuit design. For the bulk of digital logic on the chip, the logic synthesis tool 
automatically converts Boolean expressions into what is referred to as a gate-level 
netlist, at the granularity of standard cells or higher. However, a number of critical, 
low-level elements must be designed at the transistor level; this is referred to as 
circuit design. Example elements that are designed at the circuit level include static 
RAM blocks, I/O, analog circuits, high-speed functions (multipliers), and electro- 
static discharge (BSD) protection circuits. The correctness of circuit-level design is 
predominantly verified by circuit simulation tools such as SPICE. 

Physical design. During physical design, all design components are instantiated 
with their geometric representations. In other words, all macros, cells, gates, transis- 
tors, etc., with fixed shapes and sizes per fabrication layer are assigned spatial loca- 
tions (placement) and have appropriate routing connections (routing) completed in 
metal layers. The result of physical design is a set of manufacturing specifications 
that must subsequently be verified. 

Physical design is performed with respect to design rules that represent the physical 
limitations of the fabrication medium. For instance, all wires must be a prescribed 
minimum distance apart and have prescribed minimum width. As such, the design 
layout must be recreated in {migrated to) each new manufacturing technology. 
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Physical design directly impacts circuit performance, area, reliability, power, and 
manufacturing yield. Examples of these impacts are discussed below. 

— Performance: long routes have significantly longer signal delays. 

— Area: placing connected modules far apart results in larger and slower chips. 

— Reliability: large number of vias can significantly reduce the reliability of the 
circuit. 

— Power: transistors with smaller gate lengths achieve greater switching speeds at 
the cost of higher leakage current and manufacturing variability; larger transis- 
tors and longer wires result in greater dynamic power dissipation. 

— Yield: wires routed too close together may decrease yield due to electrical 
shorts occurring during manufacturing, but spreading gates too far apart may 
also undermine yield due to longer wires and a higher probability of opens. 

Due to its high complexity, physical design is split into several key steps (Fig. 1.3). 

— Partitioning (Chap. 2) breaks up a circuit into smaller subcircuits or modules, 
which can each be designed or analyzed individually. 

— Floorplanning (Chap. 3) determines the shapes and arrangement of subcircuits 
or modules, as well as the locations of external ports and IP or macro blocks. 

— Power and ground routing (Chap. 3), often intrinsic to floorplanning, distrib- 
utes power (VDD) and ground (GND) nets throughout the chip. 

— Placement (Chap. 4) finds the spatial locations of all cells within each block. 

— Clock network synthesis (Chap. 7) determines the buffering, gating (e.g., for 
power management) and routing of the clock signal to meet prescribed skew 
and delay requirements. 

— Global routing (Chap. 5) allocates routing resources that are used for connec- 
tions; example resources include routing tracks in channels and in switchboxes. 

— Detailed routing (Chap. 6) assigns routes to specific metal layers and routing 
tracks within the global routing resources. 

— Timing closure (Chap. 8) optimizes circuit performance by specialized place- 
ment and routing techniques. 

After detailed routing, electrically- accurate layout optimization is performed at a 
small scale. Parasitic resistances {R), capacitances (C) and inductances (Z,) are ex- 
tracted from the completed layout, and then passed to timing analysis tools to check 
the functional behavior of the chip. If the analyses reveal erroneous behavior or an 
insufficient design margin (guardband) against possible manufacturing and envi- 
ronmental variations, then incremental design optimizations are performed. 

The physical design of analog circuits deviates from the above methodology, which 
is geared primarily toward digital circuits. For analog physical design, the geometric 
representation of a circuit element is created using layout generators or manual 
drawing. These generators only use circuit elements with known electrical parame- 
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ters, such as the resistance of a resistor, and accordingly generate the appropriate 
geometric representation, e.g., a resistor layout with specified length and width. 

Physical verification. After physical design is completed, the layout must be fully 
verified to ensure correct electrical and logical fiinctionality. Some problems found 
during physical verification can be tolerated if their impact on chip yield is negligi- 
ble. In other cases, the layout must be changed, but these changes must be minimal 
and should not introduce new problems. Therefore, at this stage, layout changes are 
usually performed manually by experienced design engineers. 

— Design rule checking (DRC) verifies that the layout meets all technology- 
imposed constraints. DRC also verifies layer density for chemical-mechanical 
polishing {CMP). 

— Layout vs. schematic (LVS) checking verifies the fiinctionality of the design. 
From the layout, a netlist is derived and compared with the original netlist pro- 
duced from logic synthesis or circuit design. 

— Parasitic extraction derives electrical parameters of the layout elements from 
their geometric representations; with the netlist, these are used to verify the 
electrical characteristics of the circuit. 

— Antenna mle checking seeks to prevent antenna effects, which may damage 
transistor gates during manufacturing plasma-etch steps through accumulation 
of excess charge on metal wires that are not connected to PN-junction nodes. 

— Electrical rule checking (ERC) verifies the correctness of power and ground 
connections, and that signal transition times {slew), capacitive loads and fan- 
outs are appropriately bounded. 

Both analysis and synthesis techniques are integral to the design of VLSI circuits. 
Analysis typically entails the modeling of circuit parameters and signal transitions, 
and often involves the solution of various equations using established numerical 
methods. The choice of algorithms for these tasks is relatively straightforward, com- 
pared to the vast possibilities for syntheses and optimization. Therefore, this book 
focuses on optimization algorithms used in IC physical design, and does not cover 
computational techniques used during physical verification and signoff 

Fabrication. The final DRC-/LVS-/ERC-clean layout, usually represented in the 
GDSII Stream fonnat, is sent for manufacturing at a dedicated silicon foundry (fab). 
The handoff of the design to the manufacturing process is called tapeout, even 
though data transmission from the design team to the silicon fab no longer relies on 
magnetic tape [1.6]. Generation of the data for manufacturing is sometimes referred 
to as streaming out, reflecting the use of GDSII Stream. 

At the fab, the design is patterned onto different layers using photolithographic proc- 
esses. Photomasks are used so that only certain patterns of silicon, specified by the 
layout, are exposed to a laser light source. Producing an IC requires many masks; 
modifying the design requires changes to some or all of the masks. 
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ICs are manufactured on round silicon wafers with diameters ranging from 200 mm 
(8 inches) to 300 mm (12 inches). The ICs must then be tested and labeled as either 
functional or defective, sometimes according to bins depending on the functional or 
parametric (speed, power) tests that have failed. At the end of the manufacturing 
process, the ICs are separated, or diced, by sawing the wafer into smaller pieces. 

Packaging and testing. After dicing, functional chips are typically packaged. Pack- 
aging is configured early in the design process, and reflects the application along 
with cost and form factor requirements. Package types include dual in-line packages 
(DIPs), pin grid arrays (PGAs), and hall grid arrays (BGAs). After a die is posi- 
tioned in the package cavity, its pins are connected to the package's pins, e.g., with 
wire bonding or solder bumps (flip-chip). The package is then sealed. 

Manufacturing, assembly and testing can be sequenced in different ways. For exam- 
ple, in the increasingly important wafer-level chip-scale packaging (WLCSP) meth- 
odology, "bumping" with high-density solder bumps that facilitate delivery of power, 
ground and signals from the package to the die is performed before the wafer is 
diced. With multi-chip module based integration, chips are usually not packaged 
individually; rather, they are integrated as bare dice into the MCM, which is pack- 
aged separately at a later point. After packaging, the finished product may be tested 
to ensure that it meets design requirements such as function (input/output relations), 
timing or power dissipation. 
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Selecting an appropriate circuit-design style is very important because this choice 
affects time-to-market and design cost. VLSI design styles fall in two categories - 
full-custom and semi-custom. Full-custom design is primarily seen with extremely 
high- volume parts such as microprocessors or FPGAs, where the high cost of design 
effort is amortized over large production volumes. Semi-custom design is used more 
frequently because it reduces the complexity of the design process, and hence time- 
to-market and overall cost as well. 

The following semi-custom standard design styles are the most commonly used. 

— Cell-based: typically using standard cells and macro cells, the design has many 
pre-designed elements such as logic gates that are copied from libraries. 

— Array-based: typically either gate arrays or FPGAs, the design has a portion of 
pre-fabricated elements connected by pre-routed wires. 

Full-custom design. Among available design styles, a full-custom design style has 
the fewest constraints during layout generation, e.g., blocks can be placed anywhere 
on the chip without restriction. This approach usually results in a very compact chip 
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with highly optimized electrical properties. However, such effort is laborious, time- 
consuming, and can be error-prone due to a relative lack of automation. 

Full-custom design is primarily useful for microprocessors and FPGAs, where the 
high cost of design effort is amortized over large production volumes, as well as for 
analog circuits, where extreme care must be taken to achieve matched layout and 
adherence to stringent electrical performance specifications. 

An essential tool for full-custom design is an efficient layout editor that does more 
than just draw polygons (Fig. 1.4). Many improved layout editors have integrated 
DRC checkers that continuously verify the current layout. If all design-rule viola- 
tions are fixed as they occur, the final layout will be DRC -clean by construction. 
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Fig. 1.4 An example of a high-functionality layout editor [From L-Edit, Tanner Research, Inc.]. 



Standard-cell designs. A digital standard cell is a predefined block that has fixed 
size and functionality. For instance, an AND cell with two inputs contains a two- 
input NAND gate followed by an inverter (Fig. 1.5). Standard cells are distributed in 
cell libraries, which are often provided at no cost by foundries and are pre -qualified 
for manufacturing. 



im 

IN2 



AND 




OR 



INV 



NAND 



NOR 



O^T ',^:^OUT IN-^OUt'iII^^^OUT IJ^I^j^^OUT 



im 


IN2 


OUT 











1 











1 





1 


1 


1 



/W1 


IN2 


our 











1 





1 





1 


1 


1 


1 


1 



IN 


our 




1 


1 





;wi 


IN2 


OUT 








1 


1 





1 





1 


1 


1 


1 






im 


IN2 


our 








1 


1 











1 





1 


1 






Fig. 1.5 Examples of common digital cells with their input and output behavior. 



Standard cells are designed in multiples of a fixed cell height, with fixed locations of 
power ( VDD) and ground (GND) ports. Cell widths vary depending on the transistor 
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network implemented. As a consequence of this restricted layout style, all cells are 
placeable in rows such that power and ground supply nets are distributed by (hori- 
zontal) abutment (Fig. 1.6) The cells' signal ports may be located at the "upper" and 
"lower" cell boundaries or distributed throughout the cell area [1.3]. 
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Fig. 1.6 Implementation of a NAND gate using CMOS teclmology (top), and as a standard cell 
(lower left), that can be embedded in a VLSI layout (lower right). 

Since standard-cell placement has less freedom, the complexity of this design meth- 
odology is greatly reduced. This can decrease time-to-market at the cost of such 
metrics as power efficiency, layout density, or operating frequency, when compared 
to full-custom designs. Hence, standard-cell based designs, e.g., ASICs, address 
different market segments than full-custom designs, e.g., microprocessors, FPGAs, 
and memory products. A substantial initial effort is required to develop the cell 
library and qualify it for manufacturing. 



The routing between standard-cell rows uses either feedthmiigh (empty) cells within 
rows or available routing tracks across rows (Fig. 1.7). When space between cell 
rows is available, such regions are called channels. These channels, along with the 
space above the cells, can also be used for routing. Over-the-cell (OTC) routing has 
become popular as a way to use multiple metal layers - up to 8-12 in modem de- 
signs - in current process technologies. This routing style is more flexible than tradi- 
tional channel routing. If OTC routing is used, then adjacent standard-cell rows are 
typically not separated by routing channels but instead share either a power rail or a 
ground rail (Fig. 1.7 right). OTC routing is prevalent in industry today. 
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Fig. 1.7 (a) A standard-cell layout with net A-A ' routed through a feedthrough cell and cell rows 
separated by channels; each row has its own power and ground rails, (b) A standard-cell layout with 
net A-A ' routed using over-the-cell (OTC) routing; cell rows share power and ground rails, which 
requires alternating cell orientations. Designs with more than three metal layers use OTC routing. 

Macro cells. Macro cells are typically larger pieces of logic that perform a reusable 
functionality. They can range from simple (a couple of standard cells) to highly 
complex (entire subcircuits reaching the scale of an embedded processor or memory 
block), and can vary greatly with respect to their shapes and sizes. In most cases, 
macro cells can be placed anywhere in the layout area with the goals of optimizing 
routing distance or electrical properties of the design. 

Due to the increasing popularity of reusing optimized modules, macro cells, such as 
adders and multipliers, have become popular. In some cases, almost the entire func- 
tionality of a design can be assembled from pre-existing macros; this calls for top- 
level assembly, through which various subcircuits, e.g., analog blocks, standard-cell 
blocks, and "glue" logic, are combined with individual cells, e.g., buffers, to form 
the highest hierarchical level of a complex circuit (Fig. 1.8). 
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Fig. 1.8 Example layout with macro cells. 
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Gate arrays. Gate arrays are silicon chips with standard logic functionality, e.g., 
NAND and NOR, but no connections. The interconnect (routing) layers are added 
later after the chip-specific requirements are known. Since the gate arrays are not 
initially customized, they can be mass-produced. Then, the time-to-market of gate 
array-based designs is mostly constrained by the fabrication of interconnects. This 
makes gate array-based designs cheaper and faster to produce than standard cell- 
based or macro cell-based designs, particularly for low production volumes. 

The layout of gate arrays is greatly restricted, so as to simplify modeling and design. 
Due to this limited freedom, wire-routing algorithms can be very straightforward. 
Only the following two tasks are needed. 

— Intracell routing: creating a cell (logic block) by, e.g., connecting certain tran- 
sistors to implement a NAND gate. Common gate connections are typically lo- 
cated in cell libraries. 

— Intercell routing: connecting the logic blocks to form nets from the netlist. 

During physical design of gate arrays, (1) cells are selected from what is available 
on the chip, and (2) since the demand for routing resources depends on the place- 
ment configuration, a bad placement may lead to failures at the routing stage. Sev- 
eral variants and extensions of the traditional gate-array style are now available. 

Field-programmable gate arrays (FPGAs). In an FPGA, both logic elements and 
interconnects come prefabricated, but can be configured by users through switches 
(Fig. 1.9). Logic elements (LEs) are implemented by lookup tables (LUTs), each of 
which can represent any ^-input Boolean function, e.g., k ^ 4 or k ^ 5. Interconnect 
is configured using switchboxes {SBs) that join wires in adjacent routing channels. 
Configurations of LUTs and switchboxes are read from external storage and stored 
locally in memory cells. The main advantage of FPGAs is their customization with- 
out the involvement of a fabrication facility. This dramatically reduces design costs, 
up-front investment and time-to-market. However, FPGAs typically run much 
slower and dissipate more power than ASICs [1.8]. Above certain production vol- 
umes, e.g., millions of chips, FPGAs become more expensive than ASICs since the 
non-recurring design and manufacturing costs of ASICs are amortized [1.2]. 
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Structured-ASICs (channel-less gate arrays). A channel-less gate array s similar 
to an FPGA, except that the cells are usually not configurable. Unlike traditional 
gate arrays, sea of gate designs have many interconnect layers, removing the need 
for routing channels and thus improving density. The interconnects (sometimes, only 
via layers) are mask-programmed in the foundry, and are not field-programmable. 
The modem incarnation of the channel-less gate array is the structured ASIC. 
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The gates and interconnects of an integrated circuit are fornied using standard mate- 
rials that are deposited and patterned on layout layers, with the layout patterns them- 
selves conforming to design rides that ensure manufacturability, electrical perform- 
ance and reliability. 

Layout layers. Integrated circuits are made up of several different materials, the 
main ones being 

— Single-crystal silicon substrate which is doped to enable construction of n- and 
p-channel transistors 

— Silicon dioxide, which serves as an insulator 

— Polycrystalline silicon or polysdicon, which forms transistor gates and can 
serve as an interconnect material 

— Either aluminum or copper, which serves as metal interconnect 

Silicon serves as the diffusion layer. The polysilicon and the aluminum and copper 
layers are collectively referred to as interconnect layers; the polysilicon is called 
poly and the remaining layers are called Metall, MetaU, etc. (Fig. 1.10). Vias and 
contacts make connections between the different layers - vias connect metal layers 
and contacts connect poly and Metall . 
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Fig. 1.10 The different layers for a simple inverter cell, showing external connections to the channel 
below and internal connections. 
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The wire resistance is usually given as sheet resistance in ohms per square (Q/n). 
That is, for a given wire thickness, the resistance per square area remains the same - 
independent of the square size (a higher resistance for a longer length is compen- 
sated by the increased width of the square).' Hence, the resistance of any rectangular 
interconnect shape can be easily calculated as the number of unit-square areas mul- 
tiplied by the sheet resistance of the corresponding layer. 

Individual transistors are created by overlapping poly and diffusion layers. Cells, 
e.g., standard cells, are comprised of transistors but typically include one metal layer. 

The routing between cells (Chaps. 5-7) is performed entirely within the metal layers. 
This is a non-trivial task - not only are poly and Metall mostly reserved for cells, 
but different layers have varying sheet resistances, which strongly affects timing 
characteristics. For a typical 0.35 \xm CMOS process, the sheet resistance of poly is 
10 Q.lu, that of the diffusion layer is approximately 3 QJu, and that of aluminum is 
0.06 Q/n. Thus, poly should be used sparingly, and most of the routing done in 
metal layers. 

Routing through multiple metal layers requires vias. For the same 0.35 |jm process, 
the typical resistance of a via between two metal layers is 6 Q., while that of a con- 
tact is significantly higher - 20 Q.. As technology scales, modem copper intercon- 
nects become highly resistive due to smaller cross sections, grain effects that cause 
electron scattering, and the use of barrier materials to prevent reactive copper atoms 
from leaching into the rest of the circuit. In a typical 65 nm CMOS process, the 
sheet resistance of poly is 12 Q/n, that of the diffusion layer is 17 Q/n, and that of 
the copper Metall layer is 0.16 Q/n. Via and contact resistances are respectively 
1 .5 Q and 22 Q in a typical 65 nm process. 

Design rules. An integrated circuit is fabricated by shining laser light through masks, 
where each mask defines a certain layer pattem. For a mask to be effective, its lay- 
out pattem must meet specific technology constraints. These constraints, or design 
rules, ensure that (1) the design can be fabricated in the specified technology and (2) 
the design will be electrically reliable and feasible. Design rules exist both for each 
individual layer and for interactions across multiple layers. In particular, transistors 
require structural overlaps of poly and diffusion layers. 

Though design rules are complex, they can be broadly grouped into three categories. 

— Size rules, such as minimum width: The dimensions of any component (shape), 
e.g., length of a boundary edge or area of the shape, cannot be smaller than 
given minimum values (a in Fig. 1.11). These values vary across different 
metal layers. 



Since [length/width] is dimensionless, sheet resistance is measured in the same units as resistance 
(ohms). However, to distinguish it irom resistance, it is specified in ohms per square (Q In). 
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— Separation niles, such as minimum separation: Two shapes, either on the same 
layer {b in Fig. 1. 1 1) or on adjacent layers (c in Fig. 1.11), must be a minimum 
(rectilinear or Euclidean diagonal) distance apart {d in Fig. 1.11). 

— Overlap mles, such as minimum overlap: Two connected shapes on adjacent 
layers must have a certain amount of overlap (e in Fig. 1.11) due to inaccuracy 
of mask alignment to previously-made patterns on the wafer. 

To enable technology scaling, fabrication engineers use a standard unit lambda (k) 
to represent the minimum size of any design feature [1.9]. Thus, design rules are 
specified in multiples of X, which facilitates grid-based layout with base length A.. 
Such a framework is very convenient, in that technology scaling only affects the 
value of X. However, as the size of transistors decreases, the X metric is becoming 
less meaningful, as some physical and electrical properties no longer follow such 
ideal scaling. 



Ij, 1 1 




3 ^^^^1 


^H -^ X ^ 


t 


' c 






1 ^ 




1 


1 - 


^ ^^V/ 


^t\^Tgg^ 


- \ 


\ 


1 ' ■ 



Minimum Width: a,X 
Minimum Separation: b, c, d 
Minimum Overlap: e 



Fig. 1.11 Several classes of design rules. The grid's granularity is X, the smallest meaningful tech- 
nology-dependent unit of length. 
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1.5 Physical Design Optimizations 



Physical design is a complex optimization problem with several different objectives, 
such as minimum chip area, minimum wirelength, and minimum of vias. Common 
goals of optimization are to improve circuit perfonnance, reliability, etc. How well 
the optimization goals are met determines the quality of the layout. 

Different optimization goals (1) may be difficult to capture within algorithms and (2) 
may conflict with each other. However, tradeoffs among multiple goals can often be 
expressed concisely by an objective function. For instance, wire routing can opti- 
mize 
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where A is the chip area, L is the total wirelength, and wi and wi are weights that 
represent the relative importance of ^ and L. In other words, the weights influence 
the impact of each objective goal on the overall cost function. In practice, < Wi < 1, 
< W2 < 1, and Wi + W2 = 1 . 

During layout optimization, three types of constraints must be met. 

— Technology constraints enable fabrication for a specific technology node and 
are derived from technology restrictions. Examples include minimum layout 
widths and spacing values between layout shapes. 

— Electrical constraints ensure the desired electrical behavior of the design. Ex- 
amples include meeting maximum timing consfraints for signal delay and stay- 
ing below maximum coupling capacitances. 

— Geometry {design methodology) constraints are introduced to reduce the over- 
all complexity of the design process. Examples include the use of preferred 
wiring directions during routing, and the placement of standard cells in rows. 

As technology scales further, electrical effects have become increasingly significant. 
Thus, many types of electrical constraints have been introduced recently to ensure 
correct behavior. Various constraints not required at earlier technology nodes are 
necessary for modem designs. Such constraints may limit coupling capacitance to 
ensure signal integrity, prevent electromigration effects in interconnects, and pre- 
vent adverse temperature-related phenomena. 

A basic challenge is that new electrical effects are not easily translated into new 
geometric rules for layout design. For instance, is signal delay best minimized by 
reducing total wirelength or by reducing coupling capacitance between the routes of 
different nets? Such a question is further complicated by the fact that routes on other 
metal layers, as well as their switching activity, also affect signal delay. Although 
only loose geometric rules can be defined, electrical properties can be accurately 
extracted from layout, and physical simulation enables precise estimation of timing, 
noise and power. This allows designers to assess the impact of layout optimizations. 

In summary, difficulties encountered when optimizing layout include the following. 

— Optimization goals may conflict with each other. For example, minimizing 
wirelength too aggressively can result in a congested region, and increase the 
number of vias. 

— Constraints often lead to discontinuous, qualitative effects even when objective 
functions remain continuous. For example, the floorplan design might permit 
only some of the bits of a 64-bit bus to be routed with short wires, while the 
remaining bits must be detoured. 

— Constraints, due to scaling and increased interconnect requirements, are tight- 
ening, with new constraint types added for each new technology node. 
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These difficulties motivate the following rules of thumb. 

— Each design style requires its own custom flow. That is, there is no universal 
EDA tool that supports all design styles. 

— When designing a chip, imposing geometric constraints can potentially make 
the problem easier at the expense of layout optimization. For instance, a row- 
based standard-cell design is much easier to implement than a full-custom lay- 
out, but the latter could achieve significantly better electrical characteristics. 

— To further reduce complexity, the design process is divided into sequential 
steps. For example, placement and routing are performed separately, each with 
specific optimization goals and constraints that are evaluated independently. 

— When performing fundamental optimizations, the choice is often between (1) 
an abstract model of circuit performance that admits a simple computation, or 
(2) a realistic model that is computationally intractable. When no efficient algo- 
rithm or closed-fomi expression is available to obtain a globally optimal solu- 
tion, the use of heuristics is a valid and effective option (Sec. 1.6). 
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A key criterion for assessing any algorithm is its runtime complexity, the time re- 
quired by the algorithm to complete as a function of some natural measure of the 
problem size. For example, in block placement, a natural measure of problem size is 
the number of blocks to be placed, and the time t{n) needed to place n blocks can be 
expressed as 

t{n) =/(«) + c 

where /(O) = and c is a fixed amount of "overhead" that is required independently 
of input size, e.g., during initialization. 

While other measures of algorithm complexity such as memory ("space") are also of 
interest, runtime is the most important complexity metric for IC physical design 
algorithms. Complexity is represented in an asymptotic sense, with respect to the 
input size n, using big-Oh notation or 0{. . .). Formally, the runtime t{n) is order f{n), 
written as t{n) = 0{f{n)) when 
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Placement problems and their associated computational complexities include 

— Place n cells in a single row and return the wirelength: 0{n) 

— Given a single-row placement of n cells, determine whether the wirelength can 
be improved by swapping one pair of cells: 0(n^) 

— Given a single-row placement of n cells, determine whether the wirelength can 
be improved by permuting a group of three cells at a time: 0{n^) 

— Place n cells in a single row so as to minimize the wirelength: 0{n\ ■ n) with a 
naive algorithm 

Example: Exhaustively Enumerating All Placement Possibilities 

Given: n cells. 

Task: find a linear (single-row) placement of n cells with minimum total wirelength by using 

exhaustive enumeration. 

Solution: 

The solution space consists of n\ placement options. If generating and evaluating the wire- 
length of each possible placement solution takes 1 microsecond ((is) and n = 20, the total time 
needed to find an optimal solution would be 11, I'M years! 

The first three placement tasks are considered scalable, since their complexities can 
be written as 0{r/') or 0(//log «), where p is a small integer, usually/) e {1,2,3}. 
Algorithms having complexities where p > 3 are often considered not scalable. 
Furthermore, the last problem is considerably more difficult and is impractical for 
even moderate values of n, despite the existence of clever algorithms. A number of 
important problems have best-known algorithm complexities that grow exponen- 
tially with n, e.g., 0{n\), 0{n"), and 0{e"). Many of these problems are known to be 
NP-hard^ and no polynomial-time algorithms are currently known that solve these 
problems. Thus, for such problems, no known algorithms can ensure, in a time- 
efficient manner, that they will return a globally optimal solution. 

Chaps. 2-8 all deal with physical design problems that are NP-hard. For these prob- 
lems, heuristic algorithms are used to find near-optimal solutions within practical 
runtime limits. In contrast to conventional algorithms, which are guaranteed to pro- 
duce an optimal (valid) solution in a known amount of time, heuristics may produce 
inferior solutions. Algorithms that have poor worst-case complexity, but produce 
optimal solutions in all practical cases, are also considered heuristics. The primary 
goal of algorithm development for EDA is to construct heuristics that can quickly 



NP stands for non-deterministic polynomial time, and refers to the ability to validate in polyno- 
mial time any solution that was "non-deterministically guessed". NP-hard problems are at least 
as hard as the most difficult NP problems. For further reading, see [1.4] and [1.7]. 
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produce near-optimal solutions on large and complex commercial designs by ex- 
ploiting the typical features of such designs. Such heuristics often incorporate con- 
ventional algorithms for subtasks that can be solved optimally. 

Types of algorithms. Many physical design problems, e.g., placement and routing, 
are NP-hard, so solving them with optimal, worst-case polynomial-time algorithms 
is unlikely. Many heuristics have been developed for these problems, the quality of 
which can be assessed based on (1) runtime and (2) solution quality, measured by 
suboptimality (difference from optimal solutions). 

Heuristic algorithms can be classified as 

— Deterministic: All decisions made by the algorithm are repeatable, i.e., not 
random. One example of a deterministic heuristic is Dijkstra's shortest path al- 
gorithm (Sec. 5.6.3). 

— Stochastic: Some decisions made by the algorithm are made randomly, e.g., 
using a pseudo-random number generator. Thus, two independent runs of the 
algorithm will produce two different solutions with high probability. One ex- 
ample of a stochastic algorithm is simulated annealing (Sec. 3.5.3). 

In terms of structure, a heuristic algorithm can be 

— Constructive: The heuristic starts with an initial, incomplete (partial) solution 
and adds components until a complete solution is obtained. 

— Iterative: The heuristic starts with a complete solution and repeatedly improves 
the current solution until a preset termination criterion is reached. 

Physical design algorithms often employ both constructive and iterative heuristics. 
For instance, a constructive heuristic can be used to first generate an initial solution, 
which is then refined by an iterative algorithm (Fig. 1.12). 



Problem Instance 



Initial Solution 



Iterative Improvement 



Return Best-Seen Solution 




Fig. 1.12 Example of a heuristic tliat has both constructive and iterative stages. The constructive 
stage creates an initial solution that is refined by the iterative stage. 



When the solution space is represented by a graph structure consisting of nodes and 
edges (Sec. 1.7), algorithms for search and optimization can be classified in another 
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way, as described below. The graph structure can be explicit, as in wire routing, or 
implicit, with edges representing small differences between possible solutions, e.g., 
swapping a pair of adjacent standard cells in placement. 

— Breadth-first search (BFS): When searching for goal node T from starting node 
^o, the algorithm checks all adjacent nodes Si. If the goal T is not found in Si, 
the algorithm searches all of ^I's adjacent nodes 5*2. This process continues, re- 
sembling expansion of a "wave-front", until T is found or all nodes have been 
searched. 

— Depth-fiirst search (DFS): From the starting node Sq, the algorithm checks 
nodes in order of increasing depth, i.e., traversing as far as possible and as soon 
as possible. In contrast to BFS, the next-searched node 5,+i is a neighbor of 5*, 
unless all neighbors of 5, have already been searched, in which case the search 
backtracks to the highest-index location that has an unsearched neighbor. Thus, 
DFS traverses as far as possible as soon as possible. 

— Best-fiirst search: The direction of search is based on cost criteria, not simply on 
adjacency. Every step taken considers a current cost as well as the remaining 
cost to the goal. The algorithm always expands or grows from the current best 
known node or solution. An example is Dijkstra's algorithm (Sec. 5.6.3). 

Finally, some algorithms used in physical design are greedy. An initial solution is 
transformed into another solution only if the new solution is strictly better than the 
previous solution. Such algorithms find locally optimal solutions. For further read- 
ing on the theory of algorithms and complexity, see [1.4]. 

Solution quality. Given that most physical design algorithms are heuristic in nature, 
the assessment of solution quality is difficult. If the optimal solution is known, then 
the heuristic solution can be judged by its suboptimality e with respect to the optimal 
solution 



\cost{Sf^)-cost{S^pi)\ 
e = - 



cost{S^p,) 

where cost{SH) is the cost of the heuristic solution Sh and cost(Sa,„) is the cost of the 
optimal solution Sopi. This notion applies to only a tiny fraction of design problems, 
in that optimal solutions are known only for small (or artificially-created) instances. 
On the other hand, bounds on suboptimality can sometimes be proven for particular 
heuristics, and can provide useful guidance. 

When finding an optimal solution is impractical, as typical for modem designs, 
heuristic solutions are tested across a suite of benchmarks. These sets of (non-trivial) 
problem instances represent different corner cases, as well as common cases, and are 
inspired by either industry or academic research. They enable assessment of a given 
heuristic's scalability and solution quality against previously-obtained heuristic 
solutions. 
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1.7 



1.7 Graph Theory Terminology 



Graphs are heavily used in physical design algorithms to describe and represent 
layout topologies. Thus, a basic understanding of graph theory terminology is vital 
to understanding how the optimization algorithms work. The following is a list of 
basic terms; subsequent chapters will introduce specialized terminology. 

A graph G{VJ^ is made up of two sets - the set of nodes or vertices (elements), 
denoted as V, and the set of edges (relations between the elements), denoted as E 
(Fig. 1 . 1 3(a)). The degree of a node is the number of its incident edges. 

A hypergraph consists of nodes and hyperedges, with each hyperedge being a subset 
of two or more nodes. Note that a graph is a hypergraph in which every hyperedge 
has cardinality two. Hyperedges are commonly used to represent multi-pin nets or 
multi-point connections within circuit hypergraphs (Fig. 1. 13(b)). 

A multigraph (Fig. 1.13(c)) is a graph that can have more than one edge between 
two given nodes. Multigraphs can be used to represent varying net weights; an alter- 
native is to use an edge-weighted graph representation, which is more compact and 
supports non-integer weights. 








(a) 



(b) 




Fig. 1.13 (a) A graph with seven edges, (b) A hypergraph with three hyperedges having sizes four, 
tliree and two respectively, (c) A multigraph with four edges, where a-b has weight = 3. 

A path between two nodes is an ordered sequence of edges from the start node to the 
end node (a-b-f-g-e in Fig. 1. 13(a)). 

A cycle (loop) is a closed path that starts and ends at the same node (c-f-g-e-d-c in 
Fig. 1.13(a)). 



An undirected graph is a graph that represents only unordered node relations and 
does not have any directed edges. A directed graph is a graph where the direction of 
the edge denotes a specific ordered relation between two nodes. For example, a 
signal might be generated at the output pin of one gate and flow to an input pin of 
another gate - but not the other way around. Directed edges are drawn as arrows 
starting from one node and pointing to the other. 
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A directed graph is cyclic if it has at least one directed cycle (c-f-g-d-c in Fig. 1 . 14(a) 
or a-b-a in Fig. 1.14(b)). Otherwise, it is acyclic. Several important EDA algorithms 
operate on directed acyclic graph (DAG) representations of design data (Fig. 
1.14(c)). 





(a) 



(b) 




Fig. 1.14 (a) A directed graph with a cycle c-f-g-d-c. (b) A directed graph with a cycle a-h-a. (c) A 
directed acyclic graph (DAG). 

A complete graph is a graph of « nodes with 

n\ n{n-X) 



2 2\{n-2)\ 



edges, one edge between each pair of nodes, i.e., each node is connected by an edge 
to every other node. 

A connected graph is a graph with at least one path between each pair of nodes. 

A tree is a graph whose n nodes are connected by « - 1 edges. There are two types 
of trees: undirected and directed (trees that have a root). Both types are shown in Fig. 
1.15. In an undirected tree, any node with only one incident edge is a leaf. In a di- 
rected tree, the root has no incoming edge, and a leaf is a node with no outgoing 
edge. In a directed tree, there exists a unique path from the root to any other node. 



/ 




0-0 6 ifc 




^iv^ / \ 



(a) 



(b) 



Fig. 1.15 (a) A undirected graph with leaves a, e and g, and maximum node degree 3. (b) A directed 
tree with root a and leaves h and e-k. 



A spanning tree in a graph G{VJi) is a connected, acyclic subgraph G' contained 
within G that includes (spans) every node v e F. 
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A minimum spanning tree (MST) is a spanning tree with the smallest possible sum 
of edge costs (i.e., edge lengths). 

A rectilinear minimum spanning tree (RMST) is an MST where all edge lengths 
correspond to the Manhattan (rectilinear) distance metric (Fig. 1.16(a)). 
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Fig. 1.16 (a) Rectilinear minimum spanning tree (RMST) connecting points a-c with tree cost - 
(b) Rectilinear minimum Sterner tree (RSMT) connecting points a-c with tree cost = 9. 



11. 



The Steiner tree, named after J. Steiner (1796-1863), is a generalization of the span- 
ning tree. In addition to the original nodes, Steiner trees have Steiner points ( Fig. 
1.16(b)). Edges are allowed to connect to these points as well as to the original 
nodes. The incorporation of Steiner points can reduce the total edge cost of the tree 
to below that of an RMST. A Steiner minimum tree (SMT) has minimum total edge 
cost over all Steiner trees. If constructed in the Manhattan plane using only horizon- 
tal and vertical segments, the SMT is a rectilinear Steiner minimum tree {RSMT). 
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1.8 Common EDA Terminology 



The following is a brief list of important and common temis used in EDA. Many of 
these terms will be discussed in greater detail in subsequent chapters. 

Logic design is the process of mapping the HDL (typically, register-transfer level) 
description to circuit gates and their connections at the netlist level. The result is 
usually a netlist of cells or other basic circuit components and connections. 



Physical design is the process of determining the geometrical arrangement of cells 
(or other circuit components) and their connections within the IC layout. The cells' 
electrical and physical properties are obtained from library files and technology 
information. The connection topology is obtained from the netlist. The result of 
physical design is a geometrically and functionally correct representation in a stan- 
dard file format such as GDSII Stream. 
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Physical verification checks the correctness of the layout design. This includes veri- 
fying that the layout 

— Complies with all technology requirements - Design Rule Checking (DRC) 

— Is consistent with the original netlist - Layout vs. Schematic (LVS) 

— Has no antenna effects - Antenna Rule Checking 

— Complies with all electrical requirements - Electrical Rule Checking (ERC) 

A component is a basic functional element of a circuit. Examples include transistors, 
resistors, and capacitors. 

A module is a circuit partition or a grouped collection of components. 

A block is a module with a shape, i.e., a circuit partition with fixed dimensions. 

A cell is a logical or functional unit built from various components. In digital cir- 
cuits, cells commonly refer to gates, e.g., INV, AND-OR-INVERTER (AOI), 
NAND, NOR. In general, the term is used to refer to either standard cells or macros. 

A standard cell is a cell with a pre-determined functionality. Its height is a multiple 
of a library-specific fixed dimension. In the standard-cell methodology, the logic 
design is implemented with standard cells that are arranged in rows. 

A macro cell is a cell without pre-defined dimensions. This term may also refer to a 
large physical layout, possibly containing millions of transistors, e.g., an SRAM or 
CPU core, and possibly having discrete dimensions, that can be incorporated into the 
IC physical design. 

A pin is an electrical terminal used to connect a given component to its external 
environment. At the level of block-to-block connections (internal to the IC), I/O pins 
are present on lower-level metal layers such as Metall, Metal2 and Metal3. A pad is 
an electrical terminal used to connect externally to the IC. Often, bond pads are 
present on topmost metal layers and interface between external connections (such as 
to other chips) and internal connections. 

A layer is a manufacturing process level in which design components are patterned. 
During physical design, circuit components are assigned to different layers, e.g., 
transistors are assigned to poly and active layers, while interconnects are assigned to 
poly and metal layers and are routed according to the netlist. 

A contact is a direct connection between silicon (poly or another active level) and a 
metal layer, fypically Metal 1 . Contacts are often used inside cells. 

A via is a connection between metal layers, usually to connect routing structures on 
different layers. 
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A net or signal is a set of pins or terminals that must be connected to have the same 
potential. 

Supply nets are power (VDD) and ground (GND) nets that provide current to cells. 

A netlist is the collection of all signal nets and the components that they connect in a 
design, or, a list of all the nets and connecting pins of a subsection of the design. 
That is, nethsts can be organized as {I) pin-oriented - each design component has a 
list of associated nets (Fig. 1.17 center), or (2) net-oriented - each net has a list of 
associated design components (Fig. 1.17 right). Netlists are created during logic 
synthesis and are a key input to physical design. 
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Pin-Oriented Netlist Net-Oriented Netlist 

Fig. 1.17 Pin-oriented (center) and net-oriented (riglit) netlist examples for the sample circuit (left). 

A net weight w(net) is a numerical (typically integer) value given to a net net (or 
edge ed'g^e) to indicate its importance or criticality. Net weights are used primarily 
during placement, e.g., to minimize distance between cells that are connected by 
edges with high net weights, and routing, e.g., to set the priority of a net. 

The connectivity degree or connection cost c{ij) between cells ; and J for un- 
weighted nets is the number of connections between ; and J. With weighted nets, 
c{ij) is the sum of the individual connection weight between / andy. 

The connectivity c{i) of cell cellj is given by 



c{i) = '^cii,j) 
7=1 

where | V\ is the number of cells in the netlist, and c{ij) is the connectivity degree 
between cells / and J. For example, cell y in Fig. 1.18 has c(y) = 5 if each net's 
weight equals 1. 



A connectivity graph is a representation of the netlist as a graph. Cells, blocks and 
pads correspond to nodes, while their connections correspond to edges (Fig. 1 . 1 8). A 



p-pin net is represented by | | total connections between its nodes. Multiple edges 
between two nodes imply a stronger (weighted) connection. 
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Fig. 1.18 Connectivity grapli of the circuit in 
Fig. 1.17. 



The connectivity matrix C is an « x « matrix that represents the circuit connectivity 
over n cells. Each element C[i][/] represents the connection cost c(ij) between cells / 
and J (Fig. 1.19). Since C is symmetric, C[i][/] = C[/][i] for 1 < i,J < m. By defini- 
tion, C[/][/] = for 1 < / < m, since a cell / does not meaningfiilly connect with itself. 
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Fig. 1.19 The connectivity matrix of the circuit 
from Fig. 1.17. The entry C[x][>'] = C[v][-r] = 2 
because both A'l and N2 confribute one unit. 



The Euclidean distance metric between two points Pi {x\,y\) and Pj {x2,y2} corre- 
sponds to the length of the line segment between Pi and P2 (Fig. 1.20). In the coor- 
dinate plane, the Euclidean distance is 



dEiPx^Pi)- 



W(^2 



-^l)^+(>'2->'l)^ 



The Manhattan distance metric between two points Pi {x\,y\) and P2 (^2,^2) is the 
suiTi of the horizontal and vertical displaceiTients between Pi and P2 (Fig. 1.20). In 
the coordinate plane, the Manhattan distance is 

^'m (-Pi ' -^2 ) = K - ^1 1 + I J'2 - J'l I 

In Fig. 1.20, the Euclidean distance between Pi andP2 is d^iPuPi) = 5, whereas the 
Manhattan distance between Pi and P2 is d/JJ'iJ'i) = 7. 




Fig. 1.20 Distance between two points P\ and 
P2 according to Euclidean (rff) and Manhat- 
tan {cIm) distance metrics. 
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2 Netlist and System Partitioning 



The design complexity of modem integrated circuits has reached unprecedented 
scale, making full-chip layout, FPGA-based emulation and other important tasks 
increasingly difficult. A common strategy is to partition or divide the design into 
smaller portions, each of which can be processed with some degree of independence 
and parallelism. A divide-and-conquer strategy for chip design can be implemented 
by laying out each block individually and reassembling the results as geometric 
partitions. Historically, this strategy was used for manual partitioning, but became 
infeasible for large netlists. Instead, manual partitioning can be performed in the 
context of system-level modules by viewing them as single entities, in cases where 
hierarchical information is available. In contrast, automated netlist partitioning (Sees. 
2.1-2 A) can handle large netlists and can redefine a physical hierarchy of an elec- 
tronic system, ranging from boards to chips and from chips to blocks. Traditional 
netlist partitioning can be extended to multilevel partitioning (Sec. 2.5), which can 
be used to handle large-scale circuits and system partitioning on FPGAs (Sec. 2.6). 



2.1 Introduction £iL 



A popular approach to decrease the design complexity of modern integrated circuits 
is to partition them into smaller modules. These modules can range from a small set 
of electrical components to fully functional integrated circuits (ICs). The partitioner 
divides the circuit into several subcircuits (partitions or blocks) while minimizing 
the number of connections between partitions, subject to design constraints such as 
maximum partition sizes and maximum path delay. 

If each block is implemented independently, i.e., without considering other parti- 
tions, then connections between these partitions may negatively affect the overall 
design performance such as increased circuit delay or decreased reliability. More- 
over, a large number of connections between partitions may introduce inter-block 
dependencies that hamper design productivity. ' Therefore, the primary goal of parti- 
tioning is to divide the circuit such that the number of connections between subcir- 
cuits is minimized (Fig. 2.1). Each partition must also meet all design constraints. 
For example, the amount of logic in a partition can be limited by the size of an 
FPGA chip. The number of external connections of a partition may also be limited, 
e.g., by the number of I/O pins in the chip package. 



' The empiiical obsei'vation known as Rent 's rule suggests a power-law relationship between the 
number of cells /Jg and the number of extemal connections np = t-na , for any subcircuit of a "well- 
designed" system. Here, t is the number of pins per cell and r, referred to as the Rent's exponent or 
the Rent parameter, is a constant < 1. In particular, Rent's rule quantifies the prevalence of short 
wires in ICs, which is consistent with a hierarchical organization. 



A. B. Kahng et al., VLSI Physical Design: From Graph Partitioning to Timing Closure, 
DOI 10.1007/978-90-481-9591-6_2, © Springer Science+Business Media B.V. 2011 
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cuty two external connections 
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Fig. 2.1 Two different partitions induced by cuts cuti and ch?2, producing two and four external 
connections, respectively. 
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2.2 Terminology 



The following are common terms relevant to netlist partitioning. Terminology relat- 
ing to specific algorithms, such as the Kemighan-Lin algorithm (Sec. 2.4.1), will be 
introduced and defined in their respective sections. 

A cell is any logical or functional unit built from components. 

A partition or block is a grouped collection of components and cells. 

The k-way partitioning problem seeks to divide a circuit into k partitions. Fig. 2.2 
illustrates how the partitioning problem can be abstracted using a graph representa- 
tion, where nodes represent cells, and edges represent connections between cells. 




t^ 



(c) 

Fig. 2.2 (a) Sample circuit, (b) Possible graph representation, (c) Possible hypergraph representation. 

Given a graph G{V,E), for each node v e V, area{v) represents the area of the corre- 
sponding cell or module. For each edge e e E, w{e) represents the priority or weight, 
e.g., timing criticality, of the corresponding edge. 
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Though this chapter discusses the partitioning problem and partitioning algorithms 
within the graph context, logic circuits are more accurately represented using hyper- 
graphs, where each hyperedge^ connects two or more cells. Many graph-based algo- 
rithms can be directly extended to hypergraphs. 

The set of all partitions '^art\ is disjoint if each node v e Kis assigned to exactly one 
of the partitions. 

An edge between two nodes i and 7 is cut if / and 7 belong to different partitions A 
andfi, i.e., / e A,j e B, and (y) e E (Fig. 2.3). 

A cut set ^ is the collection of all cut edges. 



Fig. 2.3 A 2-way partitioning of the circuit in Fig. 2.2. 
A contains nodes a, b and/.' B contains nodes c, d, e 
and g. Edges (a,c), {b,c) and (e/) are cut. Edges (c,e), 
(c^), (d,e) and (e,g) are not cut. 
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The most common partitioning objective is to minimize the number or total weight 
of cut edges while balancing the sizes of the partitions. If *F denotes the set of cut 
edges, the minimization objective is 



^w(e) 



Often, partition area is limited due to packing considerations and other boundary 
conditions implied by system hierarchy, chip size, or floorplan restrictions. For any 
subset of nodes V c V, let area{V') be the total area of all cells represented by the 
nodes of V. Bounded-size partitioning enforces an upper bound UB on the total area 
of each partition V. That is, area(V^ < UBj, where VjC V, i ^ I, ... , k, and k is the 
number of partitions. Often, a circuit must be divided evenly, with 

area(V^) = 2, area{v) < — ^ area{v) = —area(V) 

vgK. v^V 



For convenience, hyperedges may be referred to as edges. However, graph edges are formally 
defined as node pairs. 
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For the special case where all nodes have unit area, the balance criterion is 



\V\< 



^ 2.4 Partitioning Algoritlims 



Circuit partitioning, like many other combinatorial optimization problems discussed 
in this book, is NP-hard. That is, as the problem size grows linearly, the effort 
needed to find an optimal solution grows faster than any polynomial function. To 
date, there is no known polynomial-time, globally optimal algorithm for balance- 
constrained partitioning (Sec. 1.6). However, several efficient heuristics were devel- 
oped in the 1970s and 1980s. These algorithms find high-quality circuit partitioning 
solutions and in practice are implemented to run in low-order polynomial time - the 
Kemighan-Lin (KL) algorithm (Sec. 2.4.1), its extensions (Sec. 2.4.2) and the Fi- 
duccia-Mattheyses (FM) algorithm (Sec. 2.4.3). Additionally, optimization by simu- 
lated annealing can be used to solve particularly difficult partitioning formulations. 
In general, stochastic hill-climbing algorithms require more than polynomial time to 
produce high-quality solutions, but can be accelerated by sacrificing solution quality. 
In practice, simulated annealing is rarely competitive. 

^ 2.4.1 Kernighan-Lin (KL) Algorithm 

The Kemighan-Lin (KL) algorithm performs partitioning through iterative- 
improvement steps. It was proposed by B. W. Kernighan and S. Lin in 1970 [2.6] for 
bipartitioning {k = 2) graphs, where every node has unit weight. This algorithm has 
been extended to support A;-way partitioning {k > 2) as well as cells with arbitrary 
areas (Sec. 2.4.2). 

Introduction. The KL algorithm operates on a graph representation of the circuit, 
where nodes (edges) represent cells (connections between cells). Formally, let a 
graph G{V,E) have |F| = 2« nodes, where each node v e Khas the same weight, and 
each edge e e E has a non-negative edge weight. The KL algorithm partitions V into 
two disjoint subsets^ audi? with minimum cut cost and \A\ = |_6| = n. 

The KL algorithm is based on exchanging (swapping) pairs of nodes, each node 
from a different partition. The two nodes that generate the highest reduction in cut 
size are swapped. To prevent immediate move reversal (undo) and subsequent infi- 
nite loops, the KL algorithmy;xay nodes after swapping them. Fixed nodes cannot be 
swapped until they are released, i.e., becomeyree. 
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Execution of the KL algorithm proceeds in passes. Typically, the first pass or itera- 
tion begins with an arbitrary initial partition. In a given pass, after all nodes become 
fixed, the algorithm determines the prefix of the sequence of swaps within this pass 
that produces the largest gain, i.e., reduction of cut cost. All nodes included in this 
sequence are moved accordingly. The pass finishes by releasing all fixed nodes, so 
that all nodes are once again free. In each subsequent pass, the algorithm starts with 
the two partitions from the previous pass. All possible pair swaps are then re- 
evaluated. If no improvement is found during a given pass, the algorithm terminates. 

Terminology. The following terms are specifically relevant to the KL algorithm. 

The cut size or cut cost of a graph with either unweighted or uniform-weight edges 
is the number of edges that have nodes in more than one partition. With weighted 
edges, the cut cost is the sum of the weights of all cut edges. 

The cost Z)(v) of moving a node v e Fin a graph from partition^ to B is 

Div) = \Es(v)\-\EAv)\ 

where ^'^(v) is the set of v's incident edges that are cut by the cut line, and Ej,{v) is 
the set of v's incident edges that are not cut by the cut line. High costs {D > 0) indi- 
cate that the node should move, while low costs (Z) < 0) indicate that the node 
should stay within the same partition. 

The gain Ag(a,b) of swapping a pair of nodes a and b is the improvement in overall 
cut cost that would result from the node swap. A positive gain (Ag > 0) means that 
the cut cost is decreased, while a negative gain (Ag < 0) means that the cut cost is 
increased. The gain of swapping two nodes a and b is 

Agia,b) = D(a) + D(b) - 2c{a,b) 

where D{a) and D{b) are the respective costs of nodes a and b, and c{a,b) is the 
connection weight between a and b. If an edge exists between a and b, then c{a,b) = 
the edge weight between a and b. Otherwise, c{a,b) = 0. 

Notice that simply adding D{a) and D{b) when calculating Ag assumes that an edge 
is cut (uncut) before the swap and will be uncut (cut) after the swap. However, this 
does not apply if the nodes are connected by an edge e, as it will be cut both before 
and after the swap. Therefore, the term 2c{a,b) corrects for this overestimation of 
gain from the swap. 

The maximum positive gain G,„ corresponds to the best prefix of m swaps within the 
swap sequence of a given pass. These m swaps lead to the partition with the mini- 
mum cut cost encountered during the pass. G„, is computed as the sum of Ag values 
over the first m swaps of the pass, with m chosen such that G,„ is maximized. 
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-Z^g^ 



1=1 

Within a pass of the KL algorithm, the moves are only used to find the move se- 
quence <1 . . . m> and G,„. The moves are then applied after these have been found. 

Algorithm. The Kemighan-Lin algorithm progresses as follows. 

Kemighan-Lin Algorithm 
Input: graph G{V,E) with \\/\=2n 
Output: partitioned graph G{V,E) 



1. 


{A,B) = INITIAL_PARTITION(G) 


// arbitrary initial partition 


2. 


G,r,= '^ 




3. 


while (Grr, > 0) 




4. 


/=1 




5. 


order = 




6. 


foreach (node v e V) 




7. 


status[v] = FREE 


// set every node as free 


8. 


D[v] = COST(\/) 


// compute D{v) for each node 


9. 


while (IIS FIXED(\/0) 


// while all cells are not fixed, select free 


10. 


(Ag„(a/,to;)) = MAX_GAIN(/\,e) 


// node a, from A and free node fa, from 
// B so that Ag, = D{ai) + D{bi) - 2c{ai,bi] 
II is maximized 


11. 


ADD(orcfer,(Ag,,(a,,£),))) 


// keep track of swapped cells 


12. 


TRY_SWAP(a„/3„/\,6) 


// move a,- to B, and move b; to A 


13. 


status[a;\ = FIXED 


II mark a, as fixed 


14. 


statusm = FIXED 


// mark b, as fixed 


15. 


foreach (free node Vf connected to 


a,- and b,) 


16. 


DM = COST(i/f) 


II compute and update D(vf) 


17. 


/ = / + 1 




18. 


{G,r„m) = BEST_MOVES(orc/er) 


II swap sequence ^ ... m that 
// maximizes Gm 


19. 


if(G„>0) 




20. 


CONFIRM MOVES(order,m) 


// execute move sequence 



First, partition the input graph G into two arbitrary partitions A and B (line 1), and 
set the maximum positive gain value G„, to oo (line 2). During each pass (line 3), for 
each node v e K, compute the cost Z)(v) and set v to fi"ee (lines 4-6). Then, while all 
nodes are not fixed (line 9), select, swap, and fix two free nodes a, and bj from A and 
B, respectively, such that Ag^, is maximized (lines 10-14). Update D{v) for all free 
nodes that are adjacent to a, and b, (lines 15-16). 



After all nodes have been fixed, find the move sequence <1 . . . m> that maximizes 
G„, (line 18). If G,„ is positive, execute the move sequence (lines 19-20), and perform 
another pass (lines 3-20). Otherwise, terminate the algorithm. 



2.4 Partitioning Algorithms 



39 



Example: KL Algorithm 

Given: initial partition of nodes a-h (right). 

Task: perform the first pass of the KL algorithm. 



Solution: 

Initial cut cost = 9. 

Compute D{v) costs for all free nodes a-h. 
D(a) = 1, D{h) = 1, D{c) = 2, D{d) = 1, 
D{e)=\,D{f) = 2,D{g)=\,D{h)=\ 
Ag, = D{c) + D{e) - 2c(c,e) = 2 + 1-0 = 3 
Swap and fix nodes c and e. 

Update D{v) costs for all free nodes connected to 

newly swapped nodes c and e: a, h, d,f, g and h. 

D(a) = -l,D{h) = -l,D(d) = i, 

D(f) = 2, D{g) = -l,D{h) = -l 

Ag2 = D{d) + D{f) - 2c{df) = 3 + 2-0 = 5 

Swap and fix nodes d and/ 



Update D{v) costs for all free nodes connected to 
newly swapped nodes d and/ a, h, g and h. 
D(a) = -3, D{h) = -3, D{g) = -3, D{h) = -3 
Ag3 = D{a) + D(g) - 2c{a,g) = -3 + -3 - = -6 
Swap and fix nodes a and g. 



Update D{v) costs for all free nodes connected to 

newly swapped nodes a and g: b and h. 

D{b) = -\,D(K) = -\ 

Ag4 = D(h)+D(h) - 2c{h,h) = -1 + -1 - = -2 

Swap and fix nodes h and h. 

Compute maximum positive gain G,„. 
Gi = Agi = 3 

G2 = Ag, + Ag2 = 3 + 5 = 8 
Gi = Ag, + Ag2 + Ag3 = 3 + 5 + -6 = 2 
G4 = Agi + Ag2 + Ag3 + Ag4 = 3 + 5 + -6 + -2 = 




G,„ = 8 with m = 2. Since G,„ > 0, the first m = 2 swaps are executed: (c,e) and (df). Addi- 
tional passes are performed until G,„ < 0. 



The KL algorithm does not always find an optimal solution, but performs reasonably 
well in practice on graphs with up to several hundred nodes. In general, the number 
of passes grows with the size of the graph, but in practice improvement often ceases 
after four passes. 
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Notice that Ag cannot always be positive: after all nodes have been swapped be- 
tween two partitions, the cut cost will be exactly the same as the initial cut cost, so 
some best-gain values during the pass can be negative. However, since other moves 
(gains) might compensate for this, the entire pass should be completed, computing 
all moves until all cells are fixed. 

The runtime of the KL algorithm is dominated by two factors - gain updates and 
pair selection. The KL algorithm selects n pairs of nodes to swap, where n is the 
number of nodes in each partition. For each node v, the required time to update the 
gains and compare is on the order of <9(«). That is, after swapping a, and bj in move ;', 
at most (2« - li) gains of free nodes must be updated. Therefore, the time spent 
updating gains over the n moves in a pass is at most 



Z2«- 



2/ = 0(«^) 



/=i 



During pair comparison in a given move /, there are as many as (« - / + 1)^ = 0(«^) 
pairs to choose fi'om. The time to perform n pair comparisons is bounded by 



E("- 



i + \f =0{n^) 



Therefore, the KL algorithm requires a total oiO{n^) + 0{ir') = 0{n^) time. 

An optimized KL implementation has 0{n^ log n) runtime complexity. To speed up 
pair comparison, node pairs can be sorted ahead of time. Since the goal is to maxi- 
mize Ag{a,h) = D(a) + D{b) - 2c(a,b), the gains of each node move can be sorted in 
descending order. That is, for each node a e A, order the gains D{a) such that 

Diai)>D{a2)>...>D{a„.,^i) 

Similarly, for each node b e B, order the gains D(b) such that 

D(b,)>D{b2)>...>D{b„.^^0 

Then, evaluate pairwise gains, starting with the first elements from both lists. A 
clever order of evaluation - exploiting advanced data structures and bounded node 
degrees [2.3] - allows the pair evaluation process to stop once a pair of gains D{a,) 
and D(bk) is found with D(a/) + D(bt) is less than the best previously-found gain (no 
better pair-swap can exist). In practice, the best pair-swap at the k" move can be 
found in 0(n - k) time after sorting the fi'ce node gains in 0{{n - k) log (« - k)) time 
[2.2]. The time required to perform pair comparison is thus reduced fi^om 0{n^) time 
to 0{n log n) time. 
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^ 2.4.2 Extensions of the Kernighan-Lin Algorithm 

To accommodate unequal partition sizes \A\ ^ \B\, arbitrarily split the nodes among 
the two partitions A and B, where one partition contains min(^4|,|5|) nodes and the 
other max(|^|,|5|) nodes. Apply the KL algorithm with the restriction that only 
min(|^|,|fi|) node pairs can be swapped. 

To accommodate unequal cell sizes or unequal node weights, assign a unit area that 
denotes the smallest cell area, i.e., the greatest common divisor of all cell areas. All 
unequal node sizes are then cast as integer multiples of the unit area. Each node 
portion (all parts of an original node that was split up) is connected to each of its 
counterparts by infinite-weight, i.e., high-priority edges. Apply the KL algorithm. 

To perform k-way partitioning, arbitrarily assign allk- n nodes to partitions such that 
each partition has n nodes. Apply the KL 2-way partitioning algorithm to all possi- 
ble pairs of subsets (1 and 2, 2 and 3, etc.) until none of the consecutive KL applica- 
tions obtains any improvement on the cut size. 

^ 2.4.3 Fiduccia-IVIattheyses (FIVI) Algorithm 

Given a graph G{V,E) with nodes and weighted edges, the goal of (bi)partitioning is 
to assign all nodes to disjoint partitions, so as to minimize the total cost (weight) of 
all cut nets while satisfying partition size constraints. The Fiduccia-Mattheyses (FM) 
algorithm is a partitioning heuristic, published in 1982 by C. M. Fiduccia and R. M. 
Mattheyses [2.4], offers substantial improvements over the KL algorithm. 

— Single cells are moved independently instead of swapping pairs of cells. Thus, 
this algorithm is more naturally applicable to partitions of unequal size or the 
presence of initially fixed cells. 

— Cut costs are extended to include hypergraphs (Sec. 1.8). Thus, all nets with 
two or more pins can be considered. While the KL algorithm aims to minimize 
cut costs based on edges, the FM algorithm minimizes cut costs based on nets. 

— The area of each individual cell is taken into account. 

— The selection of cells to move is much faster. The FM algorithm has runtime 
complexity of 0(\Pins\) per pass, where \Pins\ is the total number of pins, de- 
fined as the sum of all edge degrees |e| over all edges e g E. 

Introduction. The FM algorithm is typically applied to large circuit netlists. For this 
section, all nodes and subgraphs are referred to as cells and blocks, respectively. 

The FM move selection process is similar to that of the KL algorithm, with the un- 
derlying objective being to minimize cut cost. However, the FM algorithm computes 
the gain of each individual cell move, rather than that of each pair-swap. Like the 
KL algorithm, the FM algorithm selects the best prefix of moves from within a pass. 
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During an FM pass, once a cell is moved, it becomes fixed and cannot be moved for 
the remainder of the pass. The cells that are moved during the FM algorithm are 
denoted by the sequence <Ci . . . c,„>, whereas the KL algorithm swaps the first m 
pairs. 

Terminology. The following definitions are relevant to the FM algorithm. 

A net is cut if its cells occupy more than one partition. Otherwise, the net is uncut. 

The cut set of a partition par? is the set of all nets that are marked as cut within par?. 

The gain Ag^(c) for cell c is the change in the cut set size if c moves. The higher the 
gain Ag(c), the higher is the priority to move the cell c to the other partition. For- 
mally, the cell gain is defined as 

Ag{c) = FS{c)-TE{c) 

where FS{c) is the number of nets connected to c but not connected to any other 
cells within c's partition, i.e., cut nets that connect only to c, and TE{c) is the number 
of uncut nets connected to c. Informally, FS{c) is like a moving force - the higher 
the FS(c) value, the stronger the pull to move c to the other partition. TE{c) is like a 
retention force - the higher the TE{c) value, the stronger the desire to remain in the 
current partition. 

The maximum positive gain G„, of a pass is the cumulative cell gain of m moves that 
produce a minimum cut cost. Gm is determined by the maximum sum of cell gains 
Ag over a prefix of m moves in a pass 



G„,-|;Ag, 



As in the KL algorithm, all moves in a pass are used to determine G,„ and the move 
sequence <Ci . . . c„,>. Only at the end of the pass, i.e., after determining G„ and the 
corresponding m moves, are the cell positions updated (moved). 

The ratio factor is the relative balance between the two partitions with respect to cell 
area. This ratio factor is used to prevent all cells from clustering into one partition. 
The ratio factor r is defined as 

area{A) 



area{A) + area(B) 
where area{A) and area(B) are the total respective areas of partitions A and B, and 

area(A) + area{B) = area{V) 
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where area(V) is the total area of all cells c e K, and is defined as 

area{V) = y area{c) 



csV 



The balance criterion enforces the ratio factor. To ensure feasibility, the maximum 
cell area area,„^(V) must be taken into account. A partitioning of V into two parti- 
tions A and B is said to be balanced if 

r ■ area(V) - area^axiV) < area(A) < r ■ area{ V) + area,„ax(V) 

A base cell is a cell c that has maximum cell gain Agfc) among all free cells, and 
whose move does not violate the balance criterion. 

The pin distribution of a net net is given as a pair {A{net),B{net)), where A(net) is the 
number of pins in partition A and B(net) is the number of pins in partition B. 

A net net is critical if it contains a cell c whose move changes the cut state of net. A 
critical net is either contained completely within a partition, or has exactly one of its 
cells in one partition and all of its other cells in the other partition. If net is critical, 
then either ^(«eO = 0, A(net) = 1, B(net) = 0, or B{net) = 1 must hold (Fig. 2.4). 



A 



B 





/A I e /A I e 



(a) A{net) = (b) A{net) = 1 (c) B{net) = (d) B{net) = 1 

Fig. 2.4 Cases when a net net is critical, (a) A{net) = 0. (b) A(net) = 1 . (c) B{net) = 0. (d) B{net) = 1 . 

Critical nets simplify the calculation of cell gains. Only cells belonging to critical 
nets need to be considered in the gain computation, as it is only for such nets that the 
movement of a single cell can change the cut state. B. Krishnamurthy [2.8] general- 
ized the concept of criticalify and improved the FM algorithm such that it compre- 
hends how many cell moves nets are away from being critical. This results in a gain 
vector for each cell instead of a single gain value - the /" element of the gain vector 
for a free cell C/ records how many nets will become / cell moves away from being 
uncut if C/ moves. 

The from-block F and to-block T define the direction in which a cell moves. That is, 
a cell moves from F to T. 
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Algorithm. The Fiduccia-Mattheyses algorithm progresses as follows. 

Fiduccia-Mattheyses Algorithm 
Input: graph G{V,E), ratio factor r 
Output: partitioned graph G{V,E) 



1 . (Ib,ub) = BALANCE_CRITERION(G,r) 

2. (A,B) = PARTITION(G) 

3. Grr,^-^ 

4. while (Gm>0) 

5. / = 1 

6. order = 

7. foreach (cell c g V) 

8. Ag[/l[c] = FS(c)-TE(c) 

9. status[c] = FREE 

10. while (!IS_FIXED(\/)) 

1 1 . cell = MAX_GAm{Ag[i],lb,ub) 

12. ADD{order,{cell,Ag[i])) 

1 3. critical_nets = CRITICAL_NETS(ce//) 

14. if {cell e A) 

15. TRY_IViOVE(ce//,/\,e) 

16. else 

17. TRY_IVIOVE(ce//,e,/\) 

18. statuslcell]^ FIXED 

19. foreach (net net e critical_nets) 

20. foreach (cell c e net, c # cell) 

21 . if {status[c] == FREE) 

22. UPDATE_GAIN(Ag[/l[c]) 

23. /=/+1 

24. (G™ m ) = BEST_IVIOVES(orc/er) 

25. if (Grr, > 0) 

26. CONFIRIVI_IVIOVES(order,m) 



// compute balance criterion 
// initial partition 



// for each cell, compute the 
// gain for current iteration, 
// and set each cell as free 
// while there are free cells, find 
// the cell with maximum gain 
// keep track of cells moved 
// critical nets connected to cell 
II if cell belongs to partition A, 
II move cell from AioB 
II otherwise, if cell belongs to S, 
// move cell from Bio A 
II mark cell as fixed 
// update gains for critical cells 



// move sequence ci 
// maximizes Gm 



Cm that 



// execute move sequence 



First, compute the balance criterion (line 1) and partition the graph (line 2). During 
each pass (line 4), determine the gain for each free cells c based on the number of 
incident cut nets FS{c) and incident uncut nets TE(c) (lines 7-8). After all the cell 
gains have been determined, select the base cell maximizes gain subject to satisfying 
the balance criterion (line 11). If multiple cells meet these two criteria then break 
ties, e.g., to keep partition sizes as close as possible to the middle of the ranges given 
by the balance criterion. 



After the base cell has been selected, move it from its current partition to the other 
partition, and mark it as fixed (line 14-18). Of the remaining free cells, only those 
connected to the base cell by critical nets need to have their gains updated (lines 19- 
22). The gain of an unmoved cell can only change if it is connected to a moved cell 
(a base cell). Therefore, gain values change only when a net goes from occupying 
two partitions to one (positive gain) or vice-versa (negative gain). After updating the 
necessary cell gains, select the next base cell that has maximum gain value and 
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satisfies the balance criterion. Continue this process until there are no remaining free 
cells (lines 10-23). Once all cells have been fixed, find the best prefix of the move 
sequence <Ci . . . Cm> (line 24) to achieve maximum positive gain 



G.=£Ag, 



As long as G,„ > 0, execute the move sequence, free all cells, and begin a new pass. 
Otherwise, tenninate the algorithm (lines 25-26). 

Example: FM Algorithm 

Given: (1) weighted cells a-e, (2) ratio factor r = 0.375, (3) nets Ni-N^, 

and (4) initial partition (right). 

area(a) = 2 area(b) = 4 area{c) = 1 area(d) = 4 area{e) = 5 

Ni = {a,b) N2 = {a,b,c) Ni = (a,d) N4 = {a,e) Ns = (c,d) TV, 

Task: perform the first pass of the FM algorithm. 

Solution: 

Compute the balance criterion. 

r ■ area{V) - area,„^^ < area{A) < r ■ area{V) + area„,^J^V) 

r ■ area(V) - area„,jy) = 0.375 ■ 16 - 5 = 1, ?- ■ area{V) + area,„jy) = 0.375 -16 + 5 = 11. 

Size range of^: 1 < area{A) < 1 1. 




Compute the gains of each cell a-e. 

Nets N^ and A'4 are cut: FS{a) = 2. Net N^ is connected to a but is not cut: TE(a) = 1 . 
Agi(a) = 2-1 = 1. The cut size will be reduced if a moves from^ to B. 
Similarly, the gains of the remaining cells are computed. 

1 



b: 


FS(b) = 


TE(b) = 1 


Ag,(Z)) = -l 


c: 


FS{c)=\ 


TE{c)=l 


Agi(c) = 


d: 


FS{d) = 1 


TE(d) = 1 


Agi(rf) = 


e: 


FS{e)=\ 


TE{e) = 


Ag,(e)=l 



Select the base cell. Possible base cells are a and e. 

Balance criterion after moving a: area(A) = area(b) = 4. 

Balance criterion after moving e: area(A) = area(a) + area(b) + area(e) =11. 

Both moves respect the balance criterion, but a is selected, moved, and fixed as a result of the 
tie-breaking criterion described above. 

A , B 
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Update gains for all free cells that are connected by critical nets to a. To determine whether a 
given net is critical, the number of cells associated with that net in each partition is counted 
before and after the move. 

For a given net net, let the number of cells in the from-block and to-block before the move be 
F{net) and T(net), respectively. Let the number of cells in the from-block and to-block after 
the move heF'{net) and T\net), respectively. If any of these values is or 1, then net is criti- 
cal. For nets TV,, TV,, ^3, ^4, T(N{} = and TiN^) = T{Ni) = T{Nd = 1. Therefore, cells b, c, d 
and e need updating. The gain values do not have to be computed explicitly but can be derived 
from T(net). 

If T{net) = 0, all gain values of free cells connected to net increase by 1 . Since T{N[ ) = 0, cell i 
has updated gain value of Agi(6) = Agi(Z)) + 1 . That is, net N\ (connected to cell b) has in- 
creased the cut set of the partition. The increase in Ag(6) reflects the motivation to move cell b. 
Since net N^ is now cut, moving cell b is justified. 

If T{net) = 1 , all gain values of free cells connected to net decrease by 1 . Since T{N2} = T{Nt) 
= T{N^ = 1, cells c, d and e have updated gain values of Agi(c,ii,e) = Agi(c,rf,e) - 1. That is, 
nets Ni, N^ and A'4 (connected to cells c, d and e, respectively) have decreased the cut set of 
the partition. This reduction in Agi(c,rf,e) reflects the motivation to not move cells c, d and e. 
Similarly, when i^'C^O ^ 0, ^H cell gains connected to net are reduced by 1, and when F\net) 
= 1, all cell gains connected to n are increased by 1. 

The updated Ag values are 

FS(b) = 2 TE(b) = Agi(i) = 2 



FS{c) = TE{c) =1 Agi(c) = - 1 

FS(d) = TE{d) = 2 Agi(d) = -2 

FS(e) = TE{e) =1 Agi(e) = - 1 



Iteration ; = 1 

Partitions: ^1 = {b}, Bi = {a,c,d,e}, with fixed cells {a}. 

Iteration i = 2 

Cell b has maximum gain Ag2= 2, area(A) = 0, balance criterion is violated. 
Cell c has next maximum gain Ag2= -1, area(A) = 5, balance criterion is met. 
Cell e has next maximum gain Ag2= -l,area{A) = 9, balance criterion is met. 
Move cell c, updated partitions: ^2= {b,c},B2= {a,rf,e}, with fixed cells {a,c}. 

Iteration i = 3 

Gain values: Ag^ib) = 1 , Agj{d) = 0, Ag3(e) = - 1 . 

Cell b has maximum gain Agj= 1, area{A) = 1, balance criterion is met. 

Move cell b, updated partitions: ^3 = {c} , £3 = {a,b,d,e} , with fixed cells {a,b,c} . 

Iteration ; = 4 

Gain values: Ag4(rf) = 0, Ag4{e) = - 1 . 

Cell d has maximum gain Ag4= 0, area{A) = 5, balance criterion is met. 

Move cell rf, updated partitions: ^4= {c,d},B4= {a,Z),e}, with fixed cells {a,b,c,d}. 
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Iteration ; = 5 

Gain values: ^gsis) = - 1 ■ 

Cell e has maximum gain Agj = -1, area(A) = 10, balance criterion is met. 

Move cell e, updated partitions: A^ = {c,d,e},Bj= {a,h}, all cells fixed. 

Find best move sequence <C[ . . . c,„> 

Gi = Agi = 1 

G2 = Agi + Ag2 = 

G3 = Agi + Ag2 + Ag3 = 1 

G4 = Agi + Ag2 + Ag3 + Ag4 = 1 

G5 = Agi + Ag2 + Ag3 + Ag4 + Ag5 = 

m 

Maximum positive cumulative gain G„, = y Ag, = 1 found in iterations 1, 3 and 4. 

/=1 
The move prefix m = 4 is selected due to the better balance ratio {area(A) = 5); the four cells a, 
b, c and d are then moved. 



Result of Pass 1 : Current partitions: A = {c,d} , B = 

B I 

h) I 
I 



{a,h,e}, cut cost reduced from 3 to 2. 
A 




Pass 2 is left as a fiirther exercise (Exercise 3). In this pass, the cut cost is reduced from 2 to 1 . 
In Pass 3, no further improvement can be made. Thus, the partition found in Pass 2 is the final 
solution returned by the FM algorithm with a cut cost of 1 . 
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Among the partitioning techniques discussed so far, the Fiduccia-Mattheyses heuris- 
tic offers the best tradeoff between solution quality and runtime. In particular, it is 
much faster than other techniques and, in practice, finds better partitions given the 
same amount of time. Unfortunately, if the partitioned hypergraph includes more 
than several hundred nodes, the FM algorithm may terminate with a high net cut or 
make a large number of passes, each producing minimal improvement. 



To improve the scalability of netlist partitioning, the FM algorithm is typically em- 
bedded into a multilevel framework that consists of several distinct steps. First, dur- 
ing the coarsening phase, the original "flat" netlist is hierarchically clustered. Sec- 
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ond, FM is applied to the clustered netlist. Third, the netlist is partially unclustered 
during the uncoarsening phase. Fourth, during the refinement phase, FM is applied 
incrementally to the partially unclustered netlist. The third and fourth steps are con- 
tinued until the netlist is fully unclustered. In other words, FM is applied to the par- 
tially unclustered netlist and the solution is unclustered further - with this process 
repeating until the solution is completely flat. 

For circuits with hundreds of thousands of gates, the multilevel framework dramati- 
cally improves runtime because many of the FM calls operate on smaller netlists, 
and each incremental FM call has a relatively high-quality initial solution. Further- 
more, solution quality is improved, as applying FM to clustered netlists allows the 
algorithm to reassign entire clusters to different partitions where appropriate. 

► 2.5.1 Clustering 

To construct a coarsened netlist, groups of tightly -connected nodes can be clustered, 
absorbing connections between these nodes (Fig. 2.5). The remaining connections 
between clusters retain the overall structure of the original netlist. In specific appli- 
cations, the size of each cluster is often limited so as to prevent degenerate cluster- 
ing, where a single large cluster dominates other clusters. 

When merging nodes, a cluster is assigned the sum of the weights of its constituent 
nodes. As closed-form objective functions for clustering are difficult to formulate, 
clustering is perfonned by application-specific algorithms. Additionally, clustering 
must be performed quickly to ensure the scalability of multilevel partitioning. 






Fig. 2.5 An initial gj'aph (left), and possible clusterings of the graph (right). 

► 2.5.2 Multilevel Partitioning 

Multilevel partitioning techniques begin with a coarsening phase in which the 
input graph G is clustered into a smaller graph G\ which, in turn, can also be 
clustered into another graph G", and so on. Let / be the number of levels, i.e., 
times, that G goes through the coarsening stage. Each node at level / represents a 
cluster of nodes at level / + 1. For large-scale applications, the clustering ratio, i.e., 
the average number of nodes per cluster, is often 1.3 (Hypergraph Partitioning 
and Clustering, Chap. 61 in [2.5]). For a graph with \V\ nodes, the number of lev- 
els can be estimated as 
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riog(m/vo)i 

where vq is the number of nodes in the most-clustered graph (level 0), and the base 
of the logarithm is the clustering ratio. Clustering with this ratio is repeated until 
the graph is small enough to be processed efficiently by the FM partitioning algo- 
rithm. In practice, vq is typically 75-200, an instance size for which the FM algo- 
rithm is nearly optimal. For netlists with more than 200 nodes, multilevel parti- 
tioning techniques improve both runtime and solution quality. Academic 
multilevel partitioners include hMetis [2.6], available as a binary, and MLPart 
[2.1], available as open-source software. 

The most-clustered netlist is partitioned using the FM algorithm. Then, its cluster- 
to-partition assignment is projected onto the larger, partially -unclustered netlist 
one level below. This is done by assigning every subcluster to the partition to 
which its parent cluster was assigned. 

Using that partition as a starting configuration, subclusters in the partially- 
unclustered graph can be moved by the FM algorithm from one partition to an- 
other to improve cost while satisfying balance criteria {refinement). The process 
continues until the bottom-level netlist is refined (Fig. 2.6). 




Fig. 2.6 An illustration of multilevel partitioning, (a) The original graph is first coarsened through 
several levels, (b) The graph after coarsening, (c) After coarsening, a heuristic partition is fijund of 
the most-coarsened graph, (d) That partition is then projected onto the next-coarsest graph (dotted 
line) and then refined (solid line). Projection and refinement continue until a partitioning solution 
for the original graph is foimd. 
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2.6 System Partitioning onto IVIultiple FPGAs 



System implementation using field-programmable gate arrays (FPGAs), such as 
those manufactured by Xilinx or Altera, is an increasingly important application of 
partitioning. There are two main reasons for this trend. First, FPGA-based system 
prototyping allows products to meet shorter time-to-market windows, since embed- 
ded software development and design debugging can proceed concurrently with 
hardware design, rather than having to wait until the packaged dies arrive fi-om the 
foundry. Second, with increased non-recurring engineering costs (mask sets and 
probe cards) in advanced technology nodes, products with lower production vol- 
umes become economically feasible only when implemented using field- 
programmable devices. However, field-programmability (e.g., using SRAM-based 
lookup tables to implement reconfigurable logic and interconnect) comes at the cost 
of density, speed and power. Hence, even if a system easily fits onto a single ASIC, 
its prototype may require multiple FPGA devices. 

Functionally, FPGA-based systems may be viewed as logic (implemented using 
reprogrammable FPGAs) and interconnects (implemented using field-programmable 
interconnect chips, or FPICs). Many system components, including embedded proc- 
essor cores, embedded memories, and standard interfaces, are available as configur- 
able IPs on modern FPGA devices. Moreover, FPICs themselves can be imple- 
mented using FPGAs. An example FPGA-based system topology is illustrated in 
Fig. 2.7(a), where the FPGA and FPIC devices are connected using a Clos network 
topology, which allows any two devices to communicate directly (or a small number 
of hops). Fig. 2.7(b) demonstrates how a typical system architecture of logic and 
memory can be mapped onto multiple FPGA devices. 




Fig. 2.7 (a) Reconfigurable system with multiple FPGA and FPIC devices, (b) Mapping of a typical 
system architecture onto multiple FPGAs. 



Key challenges for multi-way system partitioning onto FPGAs include (1) low utili- 
zation of FPGA gate capacity because of hard FO pin limits, (2) low clock speeds 
due to long interconnect delays between multiple FPGAs, and (3) long runtimes for 
the system partitioning process itself This section discusses the associated algo- 
rithmic challenges in physical design that are unique to system implementation with 
multiple FPGAs. 
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Variant multi-way partitioning formulations. Multi-way partitioning for system 
prototyping seeks to minimize the number of FPGA devices needed while taking 
into account both area constraints, i.e., the partitions must each fit into individual 
FPGAs, and I/O constraints, i.e., each FPGA has a fixed number of pins. In contrast 
to the single-chip context, a small change in balance or cut size can make a feasible 
solution infeasible. Thus, a challenge for partitioning algorithms is to achieve high 
utilization of the FPGA devices while meeting all I/O constraints. 

Once the number of FPGA devices has been determined, the secondary optimization 
objective is to minimize the amount of communication between the connected de- 
vices. Adopting general techniques for minimizing the net cut to FPGA-based archi- 
tectures can significantly improve the overall speed of the system. However, the 
traditional net cut objective does not distinguish whether the gates of a five-pin net 
are split across two, three, four of five FPGA devices. However, splitting a net 
across k FPGA devices consumes k I/O pins. Hence, k should be minimized first. 

Variant placement formulations. The reprogrammable nature of FPGAs allows 
systems to be implemented as true reconfigurable computing machines, where de- 
vice configuration bits are updated to match the implemented logic to the required 
computation. This induces an extra "dimension" to the problems of logic partition- 
ing and placement - the solution must explicitly evolve through time, i.e., through 
the course of the computation. 

System implementation degrees of freedom. More performance optimizations are 
available, and needed, at the system level than during place-and-route. System pro- 
totyping may need to explore netlist transformations such as cloning (Sec. 8.5.3) and 
retiming (Sec. 8.6) in order to minimize cut size (FO usage) or system cycle time. 
Such transfonnations are needed as inter-device delays can be relatively large and 
because devices are often I/O-limited. L.-T. Liu etal. [2.9] proposed a partitioning 
algorithm that pennits logic replication to minimize both cut size and clock cycle of 
sequential circuits. Given a netlist G = ( V,E), their approach chooses two modules as 
seeds s and t and constructs a "replication graph" that is twice the size of the original 
circuit. This graph has the special property that a type of directed minimum cut 
yields the replication cut, i.e., a decomposition of Finto S, TandR, where s e S, t s 
T and R^ V- S - Tis the replicated logic, that is optimal. A directed version of the 
Fiduccia-Mattheyses algorithm (Sec. 2.4.3) is used to find a heuristic directed mini- 
mum cut in the replication graph. 

"Flow-based" multi-way partitioning method. To decompose a system into mul- 
tiple devices, C.-W. Yeh et al. [2.10] proposed a "flow-based" algorithm inspired by 
the relationship between multi-commodity flow [2.2] and the traditional problem of 
min-cut partitioning. The algorithm constructs a flow network wherein each signal 
net initially corresponds to an edge with unit flow cost. To visualize this, one can 
imagine a network of roads, where each road corresponds to a signal net in the net- 
list, and where driving along each individual road requires a unit toll. Two random 
modules in the network are chosen, and the shortest (lowest-cost) path between them 
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is computed. A constant y < 1 is added to the flow for each net in the shortest path, 
and the cost for every net in the path is incremented. Adjusting the cost penahzes 
paths through congested areas and forces alternative shortest paths. This random 
shortest path computation is repeated until every path between the chosen pair of 
modules passes through at least one "saturated" net (in the analogy, this would be a 
"congested road"). The set of saturated nets induces a multi-way partitioning in 
which two modules belong to the same cluster if and only if there is a path of un- 
saturated nets between them. A second phase of the algorithm makes the multi-way 
partitioning more balanced. Since this approach has efficient runtime and is easily 
parallelizable, it is well-suited for large-scale multi-way system partitioning. 

Commercial tools for partitioning large systems onto FPGAs typically base their 
algorithms on the multilevel extensions of the FM algorithm (Sec. 2.4.3). While it is 
not always possible to modify such algorithms to track relevant partitioning objec- 
tives directly, these algorithms often produce reasonable initial partitions when 
guided by the net cut objective and its variants. 
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Chapter 2 Exercises 



Exercise 1: KL Algorithm 

The graph to the right (nodes a-/) can be optimally partitioned using 
the Kemighan-Lin algorithm. Perform the first pass of the algo- 
rithm. The dotted line represents the initial partitioning. Assume all 
nodes have the same weight and all edges have the same priority. 

Note: Clearly describe each step of the algorithm. Also, show the 
resulting partitioning (after one pass) in graphical form. 




Exercise 2: Critical Nets and Gain During the FM Algorithm 

(a) For cells a-i, determine the critical nets connected to these cells and which criti- 
cal nets remain after partitioning. For the first iteration of the FM algorithm, deter- 
mine which cells would need to have their gains updated due to a move. Hint: It may 
be helpful to prepare a table with one row per move that records (1) the cell moved, 
(2) critical nets before the move, (3) critical nets after the move, and (4) which cells 
require a gain update. 




(b) Determine Ag(c) for each cell c g V. 

Exercise 3: FM Algorithm 

Perform Pass 2 of the FM algorithm example given in Sec. 2.4.3. Clearly describe 
each step. Show the result of each iteration in both numerical and graphical form. 

Exercise 4: System and Netlist Partitioning 

Explain key differences between partitioning formulations used for FPGA-based 
system emulation and traditional min-cut partitioning. 

Exercise 5: Multilevel FM Partitioning 

List and explain the advantages that a multilevel fi-amework offers compared to the 
FM algorithm alone. 



Exercise 6: Clustering 

Consider a partitioned netlist. Clustering algorithms covered in this chapter do not 
take a given partitioning into account. Explain how these algorithms can be modi- 
fied such that each new cluster is consistent with one of the initial partitions. 
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3 Chip Planning 



Chip planning deals with large modules such as caches, embedded memories, and 
intellectual property (IP) cores that have known areas, fixed or changeable shapes, 
and possibly fixed locations. When modules are not clearly specified, chip planning 
relies on netlist partitioning (Chap. 2) to identify such modules in large designs. 
Assigning shapes and locations to circuit modules during chip planning produces 
blocks, and enables early estimates of interconnect length, circuit delay and chip 
performance. Such early analysis can identify modules that need improvement. Chip 
planning consists of three major stages {\)floorplanning, {l)pin assignment, and (3) 
power planning. 

Recall from Chap. 2 that a gate-level or RTL netlist can be automatically partitioned 
into modules. Alternatively, such modules can be extracted from a hierarchical 
design representation. Large chip modules are laid out as blocks or rectangular 
shapes (Fig. 3.1). Floorplanning determines the locations and dimensions of these 
shapes, based on the areas and aspect ratios of the modules so as to optimize chip 
size, reduce interconnect and improve timing. Pin assignment covmscis outgoing 
signal nets to block pins. I/O placement finds the locations for the chip's input and 
output pads, often at the periphery of the chip. This step is (ideally) performed 
before floorplanning, but locations can be updated during and after floorplanning. 
Power planning builds the power supply network, i.e., power and ground nets, so as 
to ensure that each block is provided with appropriate supply voltage. The results of 
partitioning and chip planning greatly affect subsequent design steps. 




/ModuleM 



I/O Pads Floorplan 



rModule^CT 




r Module e ^ 




Supply Network 



Fig. 3.1 Chip planning and relevant terminology. A module is a cluster of logic with a known area. 
Once it has been assigned shape or dimensions, it becomes a block. Connections between blocks are 
implemented through intemal pins but are not shown in the figure. 



A. B. Kahng et al., VLSI Physical Design: From Graph Partitioning to Timing Closure, 
DOI 10.1007/978-90-481-9591-6_3, © Springer Science+Business Media B.V. 2011 
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3.1 Introduction to Floorplanning 



Before the floorplanning stage, the design is split into individual circuit modules. A 
module becomes a rectangular block after it is assigned dimensions or a shape.' 
These blocks can be either hard or soft. The dimensions and areas of hard blocks are 
fixed. For a soft block, the area is fixed but the aspect ratio can be changed, either 
continuously or in discrete steps. The entire arrangement of blocks, including their 
positions, is called a floorplan. In large designs, individual modules may also be 
floorplarmed in a recursive top-down fashion, but it is common to focus on one 
hierarchical level of floorplanning at a time. In this case, the floorplan of the highest 
level is called the top-level floorplan. 

The floorplanning stage ensures that (1) every chip module is assigned a shape and 
a location, so as to facilitate gate placement, and (2) every pin that has an external 
connection is assigned a location, so that internal and external nets can be routed. 

The floorplanning stage determines the external characteristics - fixed dimensions 
and external pin locations - of each module. These characteristics are necessary for 
the subsequent placement (Chap. 4) and routing (Chaps. 5-7) steps, which detemiine 
the internal characteristics of the blocks. Floorplan optimization involves multiple 
degrees of freedom; while it includes some aspects of placement (finding locations) 
and connection routing (pin assignment), module shape optimization is unique to 
floorplanning. Floorplanning with hard blocks is particularly relevant when reusing 
pre-existing blocks, including intellectual property {IP). Mathematically, this 
problem can be viewed as a constrained case of floorplanning with soft parameters, 
but in practice, it may require specialized computational techniques. 

A floorplanning instance commonly includes the following parameters - (1) the area 
of each module, (2) all potential aspect ratios of each module, and (3) the netlist of 
all (external) connections incident to the module. 

Example: Floorplan Area Minimization 

Given: three modules a-c with the following potential widths and heights. 



a: 


Wa= l,K = 4- 


or 


Wa = 4, h„ = 1 


or 


Wa = 


= 2,h„- 


= 2 


b: 


Wt=l,h,, = 2 


or 


w,, = 2, hi, = 1 










c: 


w,= l,h,= 3 


or 


w, = 3,h,= l 











Task: find a floorplan with minimum total area enclosed by its global bounding box 
(definition in Sec. 3.2). 

Solution: 

a: w„ = 2, ha = 2 h:wi, = 2, hj, = 1 c: wv = 1, /i^ = 3 

This floorplan has a global bounding box with minimum possible area 

(9 square units). 

A possible arrangement of the blocks (right). 
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3.2 Optimization Goals in Floorplanning i£. 



Floorplan design optimizes both the locations and the aspect ratios of the individual 
blocks, using simple objective functions to capture practically desirable floorplan 
attributes. This section introduces several objective functions for floorplanning. 
Goals for pin assignment are discussed in Sec. 3.6, and goals for power planning are 
described in Sec. 3.7. 

Area and shape of the global bounding box. The global bounding box of a 
floorplan is the minimum axis-aligned (isothetic) rectangle that contains all 
floorplan blocks. The area of the global bounding box represents the area of the 
top-level floorplan (the full design) and directly impacts circuit performance, yield, 
and manufacturing cost. Minimizing the area of the global bounding box involves 
finding {x,y) locations, as well as shapes, of the individual modules such that they 
pack densely together. 

Beyond area minimization, another optimization objective is to keep the aspect ratio 
of the global bounding box as close as possible to a given target value. For instance, 
due to manufacturing and package size considerations, a square chip (aspect ratio » 
1) may be preferable to a non-square chip. To this end, the shape flexibility of the 
individual modules can be exploited. Area and aspect ratio of the global bounding 
box are interrelated, and these two objectives are often considered together. 

Total wirelength. Long connections between floorplan blocks may increase signal 
propagation delays in the design. Therefore, layout of high-performance circuits 
seeks to shorten such interconnects. Switching the logic value carried by a particular 
net requires energy dissipation that grows with wire capacitance. Therefore, power 
minimization may also seek to shorten all routes. A third context for wirelength 
minimization involves routability and manufacturing cost. When the total length of 
all connections is too high or when the connections are overly dense in a particular 
region, there may not be enough routing resources to complete all connections. 
Although circuit blocks may be spread further apart to add new routing tracks, this 
increases chip size and manufacturing cost, and may further increase net length. 

To simplify computation of the total wirelength of the floorplan, one option is to 
connect all nets to the centers of the blocks. Although this technique does not yield a 
precise wirelength estimate, it is relatively accurate for medium-sized and small 
blocks, and enables rapid interconnect evaluation [3.17]. Two common approaches 
to model connectivity within a floorplan are to use (1) a connection matrix C (Sec. 
1.8) representing the union of all nets, along with pairwise distances between blocks, 
or (2) a minimum spanning tree for each net (Sec. 5.6). Using the first model, the 
total connection length L(F) of the floorplan F is estimated as 

LiF)= ^C[i]U]-d^iiJ) 
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where element C[i][/] of C is the degree of connectivity between blocks / andj, and 
dij^ij) is the Manhattan distance between the center points of; andy (Sec. 1.8). 

Using the second model, the total connection length L{F) is estimated as 

LiF)= Y.L^sA'^et) 

where LMsiinei) is the minimal spanning tree cost of net net. 

In practice, more sophisticated wirelength objectives are often used. The center-pin 
location assumption may be improved by using actual pin locations [3.17]. The 
Manhattan distance wiring cost approximation may be improved by using pin-to-pin 
shortest paths in a graph representation of available routing resources. This can 
reflect not only distance, but routing congestion, signal delay, obstacles, and routing 
channels as well. With these refinements, wiring estimation in a floorplan relies on 
the construction of heuristic Steiner minimum trees in a weighted graph (Chap. 5). 

Combination of area and total wirelength. To reduce both the total area area(F) 
and the total wirelength L{F) of floorplan F, it is common to minimize 

a ■ area(F) + (1 - a) ■ L(F) 

where the parameter < a < 1 gives the relative importance between area{F) and 
L{F). Other terms, such as the aspect ratio of the floorplan, can be added to this 
objective fiinction [3.3]. In practice, the area of the global bounding box may be a 
constraint rather than an optimization objective. This is appropriate when the 
package size and its cavity dimensions are fixed, or when the global bounding box is 
part of a higher-level system organization across multiple chips. In this case, 
wirelength and other objectives are optimized subject to the constraint that the 
floorplan fits inside a prescribed global bounding box (the fixed-outline 
floorplann ing problem). 

Signal delays. Until the 1990s, transistors that made up logic gates were the greatest 
contributor to chip delay. Since then, due to different delay scaling rates, 
interconnect delays have gradually become more important, and increasingly 
determine the chip's achievable clock frequency. Delays of long wires are 
particularly sensitive to the locations and shapes of floorplan blocks. A desirable 
quality of a floorplan is short wires connecting its blocks, such that all timing 
requirements are met. Often, critical paths and nets are given priority during 
floorplanning so that they span short distances. 

Floorplan optimization techniques have been developed that use static timing 
analysis (Sec. 8.2.1) to identify the interconnects that lie on critical paths . If timing 
is violated, i.e., path delays exceed given constraints, the floorplan is modified to 
shorten critical interconnects and meet timing constraints [3.7]. 
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3.3 



A rectangular dissection is a division of the chip area into a set of blocks or 
non-overlapping rectangles. 

A slicing floorplan is a rectangular dissection obtained by repeatedly dividing each 
rectangle, starting with the entire chip area, into two smaller rectangles using a 
horizontal or vertical cut line. 

A slicing tree or slicing flooiflan tree is a binary tree with k leaves and k-\ internal 
nodes, where each leaf represents a block and each internal node represents a 
horizontal or vertical cut line (Fig. 3.2). This book uses a standard notation, denoting 
horizontal and vertical cuts by H and V, respectively. A key characteristic of the 
slicing tree is that each internal node has exactly two children. Therefore, every 
slicing floorplan can be represented by at least one slicing tree. 
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Fig. 3.2 A slicing floorplan of blocks a^/'and two possible corresponding slicing trees. 

A non-slicing floorplan is a floorplan that cannot be formed by a sequence of only 
vertical or horizontal cuts in a parent block. The smallest example of a non-slicing 
floorplan without wasted space is the wheel (Fig. 3.3). 
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Fig. 3.3 Two different minimal non-slicing 
floorplans, also known as wheels. 



K floorplan tree is a tree that represents a hierarchical floorplan (Fig. 3.4). Each leaf 
node represents a block while each intemal node represents either a horizontal cut 
(H), a vertical cut (F), or a wheel (W). The order of the floorplan tree is the number 
of its internal (non-leaf) nodes. 



62 3 Chip Planning 



b 


d 


e 


c 


9 


a 


f 


h 


i 




Fig. 3.4 A hierarchical floorplan (left) and its con'esponding flooiplan tree of order five (right). The 
internal node W represents a wheel with blocks c-g. 

A constraint-graph pair is a floorplan representation that consists of two directed 
graphs - vertical constraint graph and horizontal constraint graph - which capture 
the relations between block positions. A constraint graph consists of edges 
connecting n + 2 weighted nodes - one source node s, one sink node t, and n block 
nodes vi, vj, ... , v„ representing n blocks mi, m2, ... , m„. The weight w(v,) of a block 
node v„ 1 < ;' < n, represents the size (relevant dimension) of the corresponding 
block; the weights of the source node w{s) and sink node w(t) are zero. Fig. 3.5 
illustrates a sample floorplan and its corresponding constraint graphs. The process of 
deriving a constraint-graph pair from a floorplan is discussed in Sec. 3 .4. 1 . 

In a vertical constraint graph (VCG), node weights represent the heights of the 
corresponding blocks. Two nodes v, and Vj, with corresponding blocks m, and nij, are 
connected with a directed edge from v, to v, if m, is below mj. 

hi a horizontal constraint graph (HCG), node weights represent the widths of the 
corresponding blocks. Two nodes v, and Vy, with corresponding blocks m, and nij, are 
connected with a directed edge from v, to v, if m, is to the left of nij. 

The longest path in the VCG corresponds to the minimum vertical extent required to 
pack the blocks (floorplan height). Similarly, the longest path in the HCG 
corresponds to the minimum horizontal extent required (floorplan width), hi Fig. 3.5, 
a longest path is shown in both the HCG and VCG. 

A sequence pair is an ordered pair (S+,S-) of block permutations. Together, the two 
permutations represent geometric relations between every pair of blocks a and b. 
Specifically, if a appears before h in both S+ and S-, then a is to the left of h. 
Otherwise, if a appears before b in S+ but not in S-, then a is above b. 



S+: <. . .a. . .b. . .> S-: <. . .a. . .b. . .> 
5+: <. . .a. . .b. . .> S-: <. . .b. . .a. . .> 



if block a is left o/block b 
if block a is above block b 



Fig. 3.5 illustrates a sample floorplan and its corresponding sequence pair. The 
process to derive a sequence pair from a floorplan is discussed in Sec. 3.4.2. 
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f 


h 


i 





Vertical 

Constraint 

Graph 



Horizontal Constraint Graph 



Sequence Pair 
S+: <badcgefhi> 
S_: <ihabcfgde> 



Fig. 3.5 A floorplan of blocks a-i (top left), its vertical constraint graph (top right), its horizontal 
constraint graph (bottom left), and its sequence pair (bottom right). 



3.4 Floorplan Representations 



3.4 



This section discusses how to convert (1) a floorplan into a constraint-graph pair, (2) 
a floorplan into a sequence pair, and (3) a sequence pair into a floorplan. Note that 
modem sequence pair algorithms do not compute constraint graphs explicitly. 
However, the constraint graphs offer a useful, intermediate representation between 
floorplans and sequence pairs for a better conceptual understanding. 

>■ 3.4.1 Floorplan to a Constraint-Graph Pair 



A floorplan can be converted to a constraint-graph pair in three steps. First, for each 
constraint graph, create a block node v, for each of the n blocks m„ 1 < / < «, a 
source node s and a sink node t. Second, for the vertical (horizontal) constraint graph, 
add a directed edge (v„v/) if m, is below (left of) m,. Third, for each constraint graph, 
remove all edges that cannot be derived from other edges by transitivity. 



Example: Floorplan to a Constraint-Graph Pair 
Given: a floorplan with blocks a-e (right). 

Task: generate the corresponding horizontal constraint graph (HCG) 
and vertical constraint graph (VCG). 
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Solution: 



VCG and HCG: Create 
nodes a-e for blocks a-e. 
Create the source node i' 
and sink node t. 



VCG: Add a directed edge 
(v;,Vj) if 772, is below wij. 

HCG: Add a directed edge 
(v„v,) if m; is left ofnij. 



VCG and HCG: Remove 
all transitive (redundant) 
edges. 







a 
0© 


b ) 




VCG 



VCG 







a 

0© 


b 
e 



o 



HCG 




HCG 





. b 


C 3 


• d ' 



VCG 



HCG 



>■ 3.4.2 Floorplan to a Sequence Pair 



A sequence pair {S+,S-) encodes the same relations as the constraint-graph pair. 
Given two blocks a and b, in the sequence pair, if a comes before h in S+, i.e., 
<...a.. .b. . .>, and a comes before b in S-, i.e., <...«.. .b. . .>, then a is to the left ofb. 
If a comes before b in S+, i.e., <...a...b...>, and a comes after b in S-, i.e., 
<...b...a...>, then a is above b. Conceptually, the constraint graphs are generated 
first, and then the block ordering rules are applied. However, the constraint graphs 
do not need to be created explicitly. Every pair of non-overlapping blocks is ordered 
either horizontally or vertically, based on the block locations. These ordering 
relations are encoded as constraints. When either constraint can be chosen, ties are 
broken, e.g., to always establish a horizontal constraint. Consider blocks a and b 
with locations (xa,jVa) and {xh,yb) and dimensions {Wa,ha) and (wb,hi,), respectively. 

if ((xa + Wfl < Xb) and not (ya + ha<yh or yb + hi, <ya)), then a is left of Z) 
if ((yb + hb < ya) and not {Xa + w^ < x/, or Xb + Wb< x^)), then a is above b 
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Example: A Floorplan to a Sequence Pair 

Given: a floorplan with blocks a-e (right, same as in Sec. 3.4.1). 

Task: generate the corresponding sequence pair. 



Solution: 

Generate the VCG and HCG (example in Sec. 3.4.1). Evaluate each graph independently. 



VCG: Blocks c and d are below block a. 
Sequence pair so far: S+-<-a...c. 

VCG: Block e is below block b. 
Sequence pair so far: S+\ <acdbe> 



.d...a...> 



S-: <cdaeb> 



HCG: Block a is left of blocks h and e. 

Sequence pair so far: S+: <acdbe> S-: <cdaeb> 

HCG: Block c is left of block d. 

Sequence pair so far: S+: <acdbe> S-: <cdaeb> 

HCG: Block d is left of block e. 

Sequence pair: S+: <acdbe> S-. <cdaeb> 

► 3.4.3 Sequence Pair to a Floorplan 

Given a sequence pair, the following additional information is needed to generate a 
floorplan: (1) the origin of the floorplan, (2) the width and height of each block, and 
(3) the packing direction of the floorplan. The origin denotes the starting location at 
which the first block should be placed. The block dimensions are needed to calculate 
the relative displacements between blocks. The packing direction facilitates area 
minimization. Given different floorplan packing strategies, e.g., packing as left and 
downward as possible, or as right and upward as possible, different floorplans can 
be generated. For this section, assume that the packing direction is left and down. 

Evaluating a sequence pair to generate a floorplan is closely related to finding the 
weighted longest common subsequence (LCS) of the two sequences [3.13]. That is, 
finding the x-coordinates of each block is equivalent to computing LCS(S+,S-), and 
finding the j-coordinates of each block is equivalent to computing LCS(S+^,S-), 
where 5+"^ is the reverse ofS+. 



Sequence Pair Evaluation Algorithm 

Input: sequence pair <S+,S^, widths (heights) of n blocks wiclths[n] {heights[n]) 

Output: X- (y-) coordinates xcoords {y_coorcls), dimensions of floorplan l/V x /-/ 



for (/ = 1 to n) 

weights[i\ = wiclths[i\ 
{x_coords,W) = LCS{S+,S-,weights) 
for (/ = 1 to n) 

weights[i\ = heights[i\ 

S."!/] = S4n + 1 - /] 
(y_coords,H) = LCS(S+'^,S-,weights) 



II weights vector as block widths 
//x-coordinates, total width W 

II weights vector as block heights 

// reverses S+ 

// y-coordinates, total height H 
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In the sequence pair evaluation algorithm, lines 1-2 initialize weights as the block 
widths for n blocks. Line 3 computes the x-coordinates of each block of the 
floorplan by finding the weighted longest common subsequence (LCS algorithm) of 
S+ and S-, given weights. Lines 4-5 re-initialize weights as the block heights. Line 6 
reverses S+ to obtain S+'^. Line 7 computes the j'-coordinates of each block of the 
floorplan by finding the weighted LCS of 5+^ and S-, given weights. 



Longest Common Subsequence (LCS) Algorithm 

Input: sequences Si and S2, weights of n elements weights[n] 

Output: positions of each b\ock positions, total span L 



1. 


for (/ = 1 to n) 


2. 


block_orcler[S2[i\] = i 


3. 


lengttisli] = 


4. 


for (/ = 1 to n) 


5. 


block =S^[i\ 


6. 


index = block_order[block] 


7. 


positions[block] = lengths[index] 


8. 


t_span = positions[block] + weightslblock] 


9. 


for (/■ = index to n) 


10. 


if (t_span > lengtlis\l\) 


11. 


lengtlis\l\ = t_span 


12. 


else break 


13. 


L = lengthsln] 



II index in S2 of each block 

// initialize total span of all blocks 

// current block 
// index in S2 of current block 
// compute position of block 
// find span of current block 



// current block span > previous 
// total span 



The procedure LCS{S\,S2,weights) computes the weighted LCS of two sequences S\ 
and ^2 with weights weights. The vector block_order records the index in 5*2 of each 
block (lines 1-2). Line 3 initializes lengths, the vector that stores the (maximum) 
span (representing the height or width) of each block, to 0. The variable block (line 4) 
corresponds to the current block in ^i (line 5); index is for the index in S2 of this 
current block (line 6). The position of block is then set to the first position not 
occupied by other blocks (line 7). That is, all blocks to the left of block are packed 
into the interval up to lengths[block], and block is positioned immediately afterward. 
Lines 9- 12 update lengths to reflect the new total span of the floorplan - e.g., the last 
element lengths[n] stores the total span after the locations of all n blocks have been 
determined (line 13). Lines 10-11 update lengths\j] with (position of block + weight 
of block) if this exceeds the current span. 

To find the x-coordinates of the floorplan, the LCS algorithm is called with ^i = 5+, 
S2 = S-, and weights = widths. To find the ^'-coordinates, the LCS algorithm is called 
with Si= S+ ,82 = S-, and weights = heights. 



Example: Sequence Pair to a Floorplan 

Given: (1) sequence pair S+: <acdbe> S-: <cdaeh>, (2) packing direction: left and down, (3) 

floorplan origin (0,0), and (4) blocks a-e with their dimensions. 



A:(w„A) = (8A) 
C:(wM = (4,5) 
E:(wM= (4,6) 



B:(w,,ht) = (4,3) 
D:(w„h,)= (4,5) 
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Task: generate the corresponding floorplan. 

Solution: 

widths[a 6 c rf e] = [8 4 4 4 4] heights[a ft c rf e] = [4 3 5 5 6] 

Find x-coordinates. 

Si=S+ = <acdbe>, 82 = 8- = <cdaeb> 

weights[a b c de]= widths[a b c de] = [S44 4 4] 

block_order[a b c de] = [i 5 I 24] 

lengths = [00000] 

Iteration ; = 1 : block = a 

index = block_order[a] = 3 

positions[a] = lengths[index] = lengths[i] = 

t_span = positions[a] + weights[a] = + 8 = 8 

Update lengths vector from index = 3 to « = 5: lengths =[00888] 

Iteration i = 2: block = c 

index = block_order\c] = 1 

positions[c] = lengths[index] = lengths[l] = 

tspan = positions[c] + weights[c] = + 4 = 4 

Update lengths vector from index = 1 to n = 5: lengths = [44888] 

Iteration / = 3: block = d 

index = block_order\d\ = 2 

positions[d\ = lengths[index] = lengths[2] = 4 

t_span = positions[d] + weights[d] = 4 + 4 = 8 

Update lengths vector from index = 2 to « = 5: lengths =[48888] 

Iteration i = 4: block = b 

index = blockjjj-der[b] = 5 

positions[b] = lengths[index] = lengths[5] = 8 

t_span = positions[b] + weights[b] = 8 + 4=12 

Update lengths vector from index = 5 to « = 5: lengths =[4888 12] 

Iteration / = 5: block = e 

index = blockj}rder[e] = 4 

positions\e] = lengths[index] = lengths[4] = 8 

t_span = positions[e] + weights[e] = 8 + 4=12 

Update lengtlis vector from index = 4 to « = 5: lengths = [4 8 8 12 12] 

x-coordiraXsy. positions[a ft c rfe] = [0 8 4 8], width of floorplan W= /e«g/fo[5] = 12. 

Find ^-coordinates . 

5i = 5/ = <ebdca>, 82 = 8- = <cdaeb> 

weights[a b cde] = heights[a b c de] = [43 5 5 6] 

block_order[a ft c rf e] = [3 5 1 2 4] 

lengths = [00000] 
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Iteration ; = 1 : block = e 

index = block_order\e] = 4 

positions[e] = lengths[mdex] = lengths[4] = 

tspan = positions[e] + weights[e] = + 6 = 6 

Update lengths vector from index = 4 to n = 5: lengths = [00066] 

Iteration i = 2: block = b 

index = block_order[b] = 5 

positions[b] = lengtlts[index] = lengths[5] = 6 

t_span = positions\b] + weights\b\ = 6 + 3 = 9 

Update lengths vector from index = 5 to n = 5: lengths =[00069] 

Iteration ! = 3: block = d 

index = block_order\d\ = 2 

positions[d] = lengtlxs[index] = lengths[2] = 

tspan = positions[d] + weights[d] = + 5 = 5 

Update lengths vector from index = 2 to n = 5: lengths =[05569] 

Iteration i = 4: block = c 

index = block_order\c] = 1 

positions[c] = lengths[index] = lengths[l] = 

tspan = positions[c] + weights[c] = + 5 = 5 

Update lengths vector from index = 1 to « = 5: lengths = [55569] 

Iteration i = 5: block = a 

index = block_order[a] = 3 

positions[a] = lengths[index] = lengths[3] = 5 

t_span = positions[a] + weights[a] = 5 + 4 = 9 

Update lengths vector from index = 3 to ;; = 5: lengths = [55999] 

/-coordinates: positions[a b c de] = [5 6 00], tieigtit of floorplan //= lengths[5] = 9. 



Floorplan size: W=l2xH=9 

Coordinates of bloclcs^-ii: 

a (0,5) Z)(8,6) c(0,0) d{4,0) e (8,0) 



2£ 3.5 Floorplanning Algorithms 



This section presents several algorithms used in floorplan optimization. Given a set 
of blocks, yZoor/»Zfl« sizing determines the minimum area of the floorplan as well as 
the associated orientations and dimensions of each individual block. Techniques 
such as cluster growth and simulated annealing comprehend the netlist of 
interconnects between blocks and seek to (1) minimize the total length of 
interconnect, subject to an upper bound on the floorplan area, or (2) simultaneously 
optimize wirelength and area. 
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>■ 3.5.1 Floorplan Sizing 



Floorplan sizing finds the dimensions of the minimum-area floorplan and 
corresponding dimensions of the individual blocks. The earliest algorithms, due to 
Often [3.8] and Stockmeyer [3.11], make use of flexible dimensions for the 
individual blocks to find minimal top-level floorplan areas and shapes. A minimal 
top-level floorplan is selected, and the individual block shapes are chosen 
accordingly. Since the algorithm uses the shapes of both the individual blocks and 
the top-level floorplan, shape functions and corner points (limits) play a major role 
in determining an optimal floorplan. 

Shape functions (shape curves) and corner points. Consider a block block with 
area area{block). By definition, if a block has width wthck and height huock, then 

Wbiock ■ htiock ^ area(block) 

This relation can be rewritten in terms of width as a shape function (Fig. 3.6(a)): 

area{block') 



block 



iw): 



w 



That is, the shape function ht,igck(w) states that for a given block width w, any block 
height h > hthcii'^) is legal. Shape functions can also include lower bounds on the 
block's width LB{wi,iock} and height LBQithck}- With such lower bounds, some {h,w) 
pairs are excluded (Fig. 3.6(b)). 




LB{h^ 



>-w 



(a) 



block) ^^{^blockl 

(b) 



h 



I I I I I I ►w - I I I I I I >w 



(c) 



(d) 



Fig. 3.6 Shape functions of blocks. The gj'ay regions contains all (h,w) pairs that form a valid block 
[3.3]. (a) Shape flmction with no restrictions, (b) Shape function with minimum width LB(wi,iocd 
and height LB{hi,iock) restrictions for a block block, (c) Shape fiinction with discrete (h,w) values, (d) 
Shape flmction of a possible hard library block, where its orientation can be more restricted. 

Due to technology-dependent design rules, {h,w) pairs can be restricted to discrete 
values (Fig. 3.6(c)). Certain block libraries can impose even stronger restrictions. 
For instance, both the range of block dimensions, as well as the orientation of the 
block can be limited (Fig. 3.6(d)). Typically, blocks are also allowed to be reflected 
about the x- or jv-axis. 
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The discrete dimensions of the block, sometimes called outer nodes (with respect to 
the feasible region of (h,w) pairs), can be thought of as non-dominated corner points 
that limit the shape function (Fig. 3.7). One academic floorplanner, DeFer, extends 
this calculus of shape functions to simultaneously optimize the shapes and locations 
of blocks in slicing floorplans [3.16]. 




I I I I I I »w 
2 5 



Fig. 3.7 A 2 X 5 rotatable library block 
that has comer points w = 2 and w = 5 
in its shape ftinction. 



Minimum-area floorplan. This algorithm finds the minimum floorplan area for a 
given slicing floorplan in polynomial time. For non-slicing floorplans, the problem 
is NP-hard [3.3]. 

Floorplan sizing consists of three major steps. 



3. 



Construct the shape functions of all individual blocks. 

Detennine the shape function of the top-level floorplan from the shape 

functions of the individual blocks using a bottom-up strategy. That is, start with 

the lowest-level blocks and perfonn horizontal and vertical composition until 

the optimal size and shape of the top-level floorplan are determined. 

From the comer point that corresponds to the minimum top-level floorplan 

area, trace in a top-down fashion back to each block's shape function to find 

that block's dimensions and location. 



Step 1: Construct the shape functions of the blocks 

Since the top-level floorplan' s shape function depends upon the shape functions of 

individual blocks, the shape functions of all blocks must be identified first (Fig. 3.8). 



hbiw) hj,w) 




I I I I I I > w 
2 3 4 5 



Fig. 3.8 Shape fiinctions of two library 
blocks o (5 X 3) and 6 (4 x 2). The shape 
fimctions h„(w) and hh(w) show the 
feasible height-width combinations of 
the blocks. 



Step 2: Determine the shape function of the top-level floorplan 
The shape function of the top-level floorplan is derived from the individual blocks' 
shape functions. Two different kinds of composition - vertical and horizontal - can 
yield different results. 
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In Fig. 3.9, block a is vertically aligned with block b. Let the shape function of block 
a be ha{w), and let the shape function of block b be hb{w). Then, the shape function 
of the top-level floorplan F is 

h[{w) = hjw) + ht{w) 



The height of 7*' is detennined by hf, adding ha{\v) and ht{w) for every comer point. 
The width of 7^ is found as Wf = max(wa,Wft). 



/ifi(w) hj,w) h^ h^ hp 

iIh^h ^ ^ ^ 

' ■^■^ 4 + 5 = 9 
2 + 5 = 7 
2 + 3 = 5 



9-- 
8-- 
7- 
6-- 
5-- 
4-- 
3-- 
2-- 




I I I I I I >iv 
2 3 4 5 

(a) 




I I I I I I > IV 
2 3 4 5 



Fig. 3.9 Vertical composition of two library blocks a (5 x 3) and 5 (4 x 2) superimposed for vertical 
composition, (a) To find the smallest bounding box for the top-level floorplan hf{w) from the 
fiinctions h„{w) and hi,{w), the block heights of the respective comer points are added. In this 
example, w = 3 and w = 5 (from ha(w)) are combined with w = 4 (from hh{w)). Note that w = 2 
(from hi,{w)) is ignored, since the width of the top-level floorplan caimot be smaller than 3, which is 
the width of block a. (b) The potential floorplans of -F are the comer points oihf{w). 

In Fig. 3.10, block a is horizontally aligned with block b. The width of F is 
determined by Wf{h), adding wjji) and Wh{h) for every comer point. The height of 7^ 

\s,hf^msx{ha,ht}. 







I I I I I I I »w 



Fig. 3.10 Shape fimctions of two library blocks a (5 x 3) and h (4 x 2) superimposed for horizontal 
composition, (a) Finding hf{w). (b) The potential floorplans of -Fare the comer points ofhfiw). 
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Step 3: Find the floorplan and individual blocks ' dimensions and locations 
Once the shape function of the top-level floorplan has been determined, the 
minimum floorplan area is computed. All minimum-area floorplans are always on 
comer points of the shape function (Fig. 3.11). After finding the comer point for the 
minimum area, each individual block's dimensions and relative location are found 
by backtracing from the floorplan's shape function to the block's shape function. 




IVlinimum area 
5 X 5 = 25 






3X5 



2X4 



2 3 4 5 



Fig. 3.11 Finding the shapes of the individual blocks from the same top-level floorplan example 
(Fig. 3.9 and Fig. 3.10) with library blocks a (5 x 3) and 5 (2 x 4). The minimum-area floorplan F 
has Wf= 5 and /z^ = 5 (left). The dimensions and locations of the individual blocks a and b (right) 
are derived by tracing back to the blocks' respective shape functions. 



Example: Floorplan Sizing 
Given: two blocks a and b. 
a:Wa=l,ha = 'i or Wa = 'i,h„- 
b: Wt = 2, hi, = 2 or Wj = 4, /;j = 



1 



1 



3 



1 



Tasli: find the minimum-area floorplan F using both horizontal and vertical composition, and 
its corresponding slicing tree. 



Solution: 

Construct the shape functions h^iw) and hi,{w) of blocks a (left) and b (right). 
h h 

2 



3 



3-- 
1-- 




1 



I I I \>w 

1 3 



I I I \*w 



Vertical composition: determine the shape function hf(w) of F and the minimum-area comer 
point. The minimum area of i^ is 8, with dimensions Wp = 4 and hp = 2. 



h 

5-- 
4-- 
3 
2 



hjy) h„M 



12 3 4 



'h hp 

2 = 5 

2 = 3 
1=2 



4X2 



hji^) 




■+>-w 
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Find the dimensions and locations of blocks a and b. 
Minimum area = 4 X 2 = 8 




I I I \ > w 

12 3 4 



6"^ 



■3 X 1 
■4 X 1 



Horizontal composition: determine the shape fimction /!f(>v) of F and the minimum-area 
comer point. The minimum area of i^ is 7, with dimensions Wf = 1 and hp= I. 



'h 



3-- 

2-- 



I 



t Hi:^ 



W„ W, Wp 

\ ^ \ 
- 1+2 = 3 
-2+3=5 
-3+4=7 



I I I I I I l^w 

12 3 4 5 7 







3X3 


•* 


- 




5X2 


— 








7X1 1 






2 3 4 5 7 



Find the dimensions and locations of blocks a and b. 
Minimum area = 7X1 



h 

▲ 

3-- 
2-- 
1-- 



I I I [ "^T I I ► H- 



w„ = 3 
wv = 4 



12 3 4 5 7 
Slicing tree of the floorplan with minimum area = min(8,7): 




O 



7 



t t 

3X1 4X1 



► 3.5.2 Cluster Growth 



In a cluster growth-based method, the floorplan is constructed by iteratively adding 
blocks until all blocks have been assigned (Fig. 3.12). An initial block is chosen and 
placed in the lower-left (or any other) comer. Subsequent blocks are added, one at a 
time, and merged either horizontally, vertically, or diagonally with the cluster. The 
location and orientation of the next block depends on the current shape of the cluster; 
it will be placed to best accommodate the objective function of the floorplan. In 
contrast to the floorplan-sizing algorithm, only the different orientations of the 
individual blocks are taken into account. Methods that directly and simultaneously 
optimize both the shapes of each block and the floorplan are not currently known. 
The order of the blocks is typically chosen by a linear-ordering algorithm. 
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Direction 




I I >w 



Fig. 3.12 Floorplan construction with a cluster-growth algoritlim based on minimum floorplan area. 
Block a is placed first. Blocks b and c are placed so that the increase to the floorplan's dimensions 
is minimum. 

Linear ordering. Linear-ordering algorithms are often invoked to produce initial 
placement solutions for iterative-improvement placement algorithms. The objective 
of linear ordering is to arrange given blocks in a single row so as to minimum the 
total wirelength of connections between blocks. 

Kang [3.4] classified incident nets of a given block into the following three 
categories with respect to a partially constructed, left-to-right ordering (Fig. 3.13). 

— Terminating nets have no other incident blocks that are unplaced. 

— New nets have no pins on any block from the partially-constructed ordering. 

— Continuing nets have at least one pin on a block from the partially-constructed 
ordering and at least one pin on an unordered block. 



N, 



W, 






New Nets: A/j and N^ 

Terminating Nets: A/., and A/g 
Continuing Nets: A/g 



Fig. 3.13 Classification of nets based on the partially-constructed linear ordering. 



The ordering of a particular block m directly depends on the types of nets attached to 
m. Specifically, the block that completes the greatest number of "unfinished" nets 
should be placed first. In other words, the block with the highest difference between 
the number of terminating and new nets is chosen as the next block in the sequence. 

The linear-ordering algorithm starts by choosing an initial block. This can be done 
either arbitrarily or based on the number of connections to the other blocks (line 1). 
During every iteration, the gain gain for each block m e Af is calculated (lines 5-8), 
where gain is the difference between the number of terminating nets and new nets of 
m. The blocks with the maximum gain are selected (line 9). In case there are 
multiple blocks that have the maximum gain value, the blocks with the most 
terminating nets are selected (lines 10-11). If there are still multiple blocks, the 
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blocks with the most continuing nets are selected (lines 12-13). If there are still 
multiple blocks, the blocks with the lowest connectivity are selected (lines 14-15). If 
there are still multiple blocks, a block is selected arbitrarily (lines 16-17). The 
selected block is then added to order and removed from M (lines 1 8-19). 



// remove seed from M 
II while M is not empty 



Linear Ordering Algorithm [3.4] 
Input: set of all blocks M 
Output: ordering of blocks order 

1 . seed = starting block 

2. order = seed 

3. REIVIOVE(M,seec/) 

4. while {\M\ > 0) 

5. foreach (block m e M) 

6. term_nets[m] = number of terminating nets incident to m 

7. new_nets[m] = number of new nets incident to m 

8. gain[m] = term_net^m] - new_net^m] 

9. M' = block(s) that have maximum gain[m] 

10. if(|M1>1) // multiple blocks 

11. M'= block(s) with the most terminating nets 

12. if(|Ml>1) // multiple blocks 

13. M'= block(s) with the most continuing nets 

14. if(|Ml>1) // multiple blocks 

15. M'= block(s) with the fewest connected nets 

16. if(|M1>1) // multiple blocks 

17. M'= arbitrary block in M 

1 8. ADD{order,M) II add M' to order 

1 9. REIVIOVE(M,M') // remove M' from M 



N, 



N, 



Example: Linear Ordering Algorithm N-, 

Given: (1) netlist with five blocks a-e, 
(2) starting block a, and (3) six nets Ni-N(,. 

Ni = (a,b) N2 = (a,d) N^ = (a,c,e) 

N^ = (b,d) Ns = (c,d,e)Ne = (d,e) "' N, 

Task: find a linear ordering using the linear ordering algorithm. 



A'. 



N, 



Solution: 



Iteration # Block New Nets Terminating Nets gain Continuing Nets 







M^2^3 



b 

c 
d 

e 



N,,N,M 



Ni 
N2 




-1 
-2 

-2 



N3 
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Ni 
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For each iteration, bold font denotes the bloclc with maximum gain. 
Iteration 0: set block a as the first block in the ordering. 

Iteration 1 : block b has maximum gain. Set as the second block in the ordering. 
Iteration 2: block d has maximum gain. Set as the third block in the ordering. 
Iteration 3: block e has maximum gain. Set as the fourth block in the ordering. 
Iteration 4: set c as the fifth (last) block in the ordering. 



The hnear ordering that heuristically 
minimizes total net cost is <ab d e c>. 



N, 



r 

a 














1 
C 


N, 


1 
h 


N, 


1 r 

d 

1 1 




1 

e 

1 


^5 



A', 



A^. 



Cluster growth. In the cluster-growth algorithm, the blocks are ordered using the 
linear ordering algorithm (line 2). For each block curr_block (line 4), the algorithm 
finds a location such that the floorplan grows evenly - toward the upper and right 
sides - while satisfying criteria such as the shape constraint of the top-level 
floorplan (line 5). Other typical criteria include total wirelength or the amount of 
deadspace within the floorplan. This algorithm is similar to the Tetris algorithm 
used for cell legalization (Sec. 4.4). 



Cluster-Growth Algorithm 

Input: set of all blocks M, cost function C 

Output: optimized floorplan F based on C 

1. F=0 

2. order =LINEAR_ORDERING(M) 

3. for(/= 1 tolorderl) 

4. curr_block = orcler[i\ 

5. ADD_JOJFLOORPlAN{F,curr_block,C) 



II generate linear ordering 



// find location and orientation 
// of curr_block that causes 
// smallest increase based on 
// C while obeying constraints 



The example below illustrates the cluster-growth algorithm. It uses the same linear 
ordering found in the previous example. The objective is to find the smallest 
floorplan, i.e., to minimize the area of the top-level floorplan. Though it produces 
mediocre solutions, the cluster-growth algorithm is fast and easy to implement; 
therefore, it is often used to find initial floorplan solutions for iterative algorithms 
such as simulated annealing. 



Example: Floorplan 


Construction by Cluster Growth 




Given: (1) blocks a-t 


z and (2) linear ordering <ab d ec>. 




a: w„ = 2, /!„ = 3 


or w„ = 3, /!„ = 2 


^ 


h: wt = 2,ht=l 


or Wfc=l,/!ft = 2 


/ 


c: w, = 2,h, = 4 


or -Wc = 4, he = 2 Growth Direction: 


/ 


d: Wj=3,hj=2 
e: Wg = 6,hg= 1 




/ 






Task: find a floorplan with minimum global bounding box area. 
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Solution: (multiple solutions possible; one is illustrated below) 

Block a: place in lower-left comer with vi^ = 2 and ha = 3. h 

Current bounding box area = 2x3=6. 



Block h: place above block a with wt, = 2 and hi,= 1 . 
Current bounding box area = 2 x 4 = 8. 

Block d: place to the right of block a with Wj = 3 and hj='i. 
Current bounding box area = 5x4 = 20. 

Block e: place above block b with w^, = 6 and /i^, = 1 . 
Current bounding box area = 6x5=30. 

Block c: place to the right of block d with w^, = 2 and /i^ = 4. 
Current bounding box area = 7x5=35. 



I I I I I ^w 



I I I I I » w 



I I I I > w 



\ \ \ \ \*w 



1 — I — I — I — y*-w 



>■ 3.5.3 Simulated Annealing 

Simulated annealing {SA) algorithms are iterative in nature - they begin with an 
initial (arbitrary) solution and seek to incrementally improve the objective function. 
During each iteration, a local neighborhood of the current solution is considered. A 
new candidate solution is formed by a small perturbation to the current solution 
within its neighborhood. 

Unlike greedy algorithms, SA algorithms do not always reject candidate solutions 
with higher cost. That is, in a greedy algorithm, if the new solution is better, e.g., has 
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lower cost (assuming a minimization objective), than the current solution, it is 
accepted and replaces the current solution. If no better solution exists in the local 
neighborhood of the current solution, the algorithm has reached a local minimum. 
The key drawback of greedy algorithms is only beneficial changes to the current 
solution are accepted. 

Fig. 3.14 illustrates how the greedy approach breaks down. Starting from initial 
solution /, the algorithm goes downhill and eventually reaches solution state L. 
However, this is only a local minimum, not the global minimum G. A greedy 
iterative improvement algorithm, unless given a fortuitous initial state, will be 
unable to reach G. On the other hand, iterative approaches that accept inferior 
(non-improving) solutions can hill-climb away from L and potentially reach G. The 
simulated annealing algorithm is one of the most successful optimization strategies 
to integrate hill-climbing with iterative improvement. 

Cost 



Fig. 3.14 For a minimization problem, an 
initial solution 7, along with local optimum 
(Z,) and global optimum (G) solutions. In the 
neighborhood structure shown, the cost 
Solution function has several local minima, as is the 

States case with most (intractable) layout problems. 

Principle of simulated annealing. In materials science, annealing refers to the 
controlled cooling of high-temperature materials to modify their properties. The goal 
of annealing is to alter the atomic structure of the material and reach a 
minimum-energy configuration. For instance, the atoms of a high-temperature metal 
are in high-energy, disordered states (chaos), while the atoms of a low-temperature 
metal are in low-energy, ordered states (crystalline structures). However, the same 
metal may experience atomic configurations that are brittle or mechanically hard, 
depending on the size of the individual crystals. When the high-temperature metal is 
cooled, it goes from a highly randomized state to a more structured state. The rate at 
which the metal is cooled will drastically affect the final structure. Moreover, the 
way in which the atoms settle k probabilistic in nature. In practice, a slower cooling 
process implies a greater chance that the atoms will settle as a perfect lattice, 
forming a minimum-energy configuration. 

The cooling process occurs in steps. At each step of the cooling schedule, the 
temperature is held constant for a specified amount of time. This allows the atoms to 
gradually cool down and stabilize, i.e., reach a thermodynamic equilibrium, at each 
given temperature. Though the atoms have the ability to move across large distances 
and create new higher-energy states, the probability of such drastic changes in this 
configuration decreases with temperature. 
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As the cooling process continues, the atoms eventually will settle in a local, and 
possibly global, minimum-energy configuration. Both the rate and step size of the 
temperature decrease will affect how the atoms will settle. If the rate is sufficiently 
slow and the increment is sufficiently small, the atoms will settle, with high 
probability, at a global minimum. On the other hand, if cooling is too fast or the 
increment is too large, then the atoms are less likely to attain the global minimum- 
energy configuration, and instead will settle in a local minimum instead. 

Annealing-based optimization. The principle of annealing can be applied to solve 
combinatorial optimization problems. In the context of minimization, finding the 
lowest-cost solution in an optimization problem is analogous to finding a 
minimum- energy state of a material. Thus, simulated annealing algorithms take a 
"chaotic" (higher-cost) solution and emulate physical annealing to produce a 
"structured" (lower-cost) solution. 

The simulated annealing algorithm generates an initial solution and evaluates its cost. 
At each step, the algorithm generates a new solution by performing a random walk 
in the solution by applying a small perturbation (change in structure). This new 
solution is then accepted or rejected based on a temperature parameter T. When T is 
high (low), the algorithm has a higher (lower) chance of accepting a solution with 
higher cost. Analogous to physical annealing, the algorithm slowly decreases T, 
which correspondingly decreases the probability of accepting an inferior, higher-cost 
solution. One method for probabilistically accepting moves is based on the 
Boltzmann acceptance criterion, where the new solution is acceptance if 

cost(ciirr_sol)- cost{next_sol) 

e ^ > r 

Here, curr_sol is the current solution, next_sol is the new solution after a 
perturbation, T is the current temperature, and r is a random number between [0,1) 
based on a uniform distribution. For a minimization problem, the final solution will 
be in a valley; for a maximization problem, it will be at apeak. 

The rate of temperature decrease is extremely important - it (1) must enable 
sufficient high-temperature exploration of the solution space at the beginning, while 
(2) allowing enough time at low temperatures to have sufficient probability of 
settling at a near-optimal solution. Just as slow cooling of high-temperature metal 
has a high probability of finding a globally optimal, energy-minimal crystal lattice, a 
simulated annealing algorithm with a sufficiently slow cooling schedule has high 
probability of finding a high-quality solution for a given optimization problem [3.5]. 

The simulated annealing algorithm is stochastic by nature - two runs usually yield 
two different results. The difference in quality stems from probabilistic decisions 
such as generation of new, perturbed solutions (e.g., by a cell swap), and the 
acceptance or rejection of moves. 
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Algorithm. The algorithm begins with an initial (arbitrary) solution curr_sol (lines 
3-4), and generates a new solution by perturbing curr_sol (lines 8-9). The resulting 
new cost trial_cost is computed (line 10), and compared with the current cost 
cuiT_cost (line 1 1). If the new cost is better, i.e., Acost < 0, the change of solution is 
accepted (lines 12-14). Otherwise, the change may still be probabilistically accepted. 
A random number < r < 1 is generated (line 16), and if r is smaller than g'^™" ^, the 
change is accepted. Otherwise, the change is rejected (lines 17-19). 

To perform maximization, reverse the order of operands for the subtraction in line 
11. In the context of floorplanning, the function TRYMOVE (line 9) can be 
replaced with operations such as moving a single block or swapping two blocks. 



Simulated Annealing Algorithm 

Input: initial solution init_sol 

Output: optimized new solution currsol 



1. 

2. 


7=7-0 
/=0 


3. 


curr_sol = init_sol 


4. 
5. 
6. 
7. 


curr_cost = COST {curr_sol) 
while (7 > Tmin) 
while (stopping criterion is not met) 
/ = / + 1 


8. 
9. 
10. 
11. 


(a/,/3/) = SELECT PA\R{curr sol) 
trial_sol = TRY_MOVE(a/,i),) 
trial_cost = COSJ{trial_sol) 
Acost = trial_cost - curr_cost 


12. 
13. 


if {Acost < 0) 
curr_cost = trial_cost 


14. 
15. 


curr_sol = IVIOVE(a/,i);) 
else 


16. 
17. 
18. 


r=RANDOM(0,1) 
curr_cost = trialcost 


19. 
20. 


curr sol = l\/IOVE(a,,£),) 
T=a- T 



II initialization 



// select two objects to perturb 
// try small local change 



// if there is improvement, 
// update the cost and 
// execute the move 

// random number [0,1) 
// if it meets threshold, 
// update the cost and 
// execute the move 
//O < a < 1, 7 reduction 



The probability of the exchange depends on both cost and temperature, and a larger 
(worse) cost difference implies a smaller chance that the change is accepted. At high 
temperatures, i.e., 7 ^ oo, -Acost / T~ and the probability of accepting a change 
approaches e° = 1 . Again, this leads to frequent acceptance of inferior solutions, 
which helps the algorithm escape "low-quality" regions of the solution space. On the 
other hand, at low temperatures, i.e., 7^0, -Acost / T~ co and the probability of 
accepting a change approaches e °° = l/e"= 0. The initial temperature (7o) and the 
degradation rate (a) are usually set empirically. In general, better results are 
produced with a high initial temperature and a slow rate of cooling, at the cost of 
increased runtime. 
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Simulated annealing-based floorplanning. The first simulated annealing algorithm 
for floorplanning was proposed in 1984 by R. Often and L. van Ginneken [3.9]. 
Since then, simulated annealing has become one of the most common iterative 
methods used in floorplanning. 

In the direct approach, SA is applied directly to the physical layout, using the actual 
coordinates, sizes, and shapes of the blocks. However, finding a fully legal solution 
- a floorplan with no block overlaps - is difficult. Thus, intermediate solutions are 
allowed to have overlapping blocks, and a penalty function is incorporated to 
encourage legal solutions. The final produced solution, though, must be completely 
legal (see [3.10] for further reading). 

In the indirect approach, simulated annealing is applied to an abstraction of the 
physical layout. Abstract representations capture the floorplan using trees or 
constraint graphs. A final mapping is also required to generate the floorplan from the 
abstract representation. One advantage of this process over the direct approach is 
that all intermediate solutions are overlap- free. 

For further reading on simulated annealing-based floorplanning, see [3.1], [3.3], 
[3.14] and [3.15]. 

>■ 3.5.4 Integrated Floorplanning Algorithms 

Analytic techniques map the floorplanning problem to a set of equations where the 
variables represent block locations. These equations describe boundary conditions, 
attempt to prevent block overlap, and capture other relations between blocks. In 
addition, an objective function quantifies the important parameters of the floorplan. 

One well-known analytic method is mixed integer-linear programming (MILP), 
where the location variables are integers. This technique does not allow for overlaps 
and seeks globally optimal solutions. However, it is limited due to its computational 
complexity. For a problem size of 100 blocks, the integer program can have over 
10,000 variables and over 20,000 equations. Thus, MILP is usable only for small (10 
or fewer blocks) instances. 

A faster alternative that offers some compromises is to use a linear programming 
(LP) relaxation. Compared to MILP, the LP formulation does not limit the locations 
to be integers. However, LP can be used for larger problem instances. 

For further discussion of floorplanning with analytic methods, see [3.1]. A technique 
for floorplan repair (legalization) is described in [3.7]. 
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3.6 



3.6 Pin Assignment 



Given the large geometric sizes of blocks during floorplarming, the terminal 
locations of nets connecting these blocks are very important. I/O pins (net terminals) 
and their locations are usually on the periphery of a block to reduce interconnect 
length. However, the best locations depend on the relative placement of the blocks. 

Problem formulation. During pin assignment, all nets (signals) are assigned to 
unique pin locations such that the overall design performance is optimized. 
Common optimization goals include maximizing routability and minimizing 
electrical parasitics both inside and outside of the block. 




Pin 
Assignment 



I^J^ 







90 Connections 



90 Pins 



90 Pins 



Fig. 3.15 The pin assignment process. Here, each of the 90 I/O pms from the chip is assigned a 
specific I/O pin on a printed circuit board. Each pin pair is then connected by a route. 



The goal of external pin assignment is to connect each incoming or outgoing signal 
to a unique FO pin. Once the necessary nets have each been assigned a unique pin, 
they must be connected such that wirelength and electrical parasitics, e.g., coupling 
or reduced signal integrity, are minimized. For instance. Fig. 3.15 shows 90 pins on 
the microprocessor chip, each of which must be connected to an I/O pad at the next 
hierarchy level. After pin assignment, each pin from the chip has a connection to a 
unique pin on the external device, connected with short routes. 



Functionally-Equivalent Pins 



_M_ 

Electrically-Equivalent Pins 




iM Metal2 


■ Contact 


^ MetaM 


El Via 


^M polysilicon 




1 p/n diffusion 



Fig. 3.16 Functionally-equivalent input pins and electrically-equivalent output pins for a simplified 
example of an nMOS NAND gate. 
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Alternatively, pin assignment can be used to connect cell pins that are functionally 
or electrically equivalent, such as during standard-cell placement (Chap. 4). Two 
pins are functionally equivalent if swapping them does not affect the design's logic 
and electrically equivalent (equipotential) if they are connected (Fig. 3.16). 

The main objective oi internal pin assignment for cells is to reduce congestion and 
interconnect length between the cells (Fig. 3.17). Pin assignment techniques 
discussed below apply to both chip planning and placement (Chap. 4). 





Fig. 3.17 Pin assignment using tlie example given in Fig. 3.16. The assigmnent aims to minimize 
the total connection length by exploiting fiinctionally and electrically equivalent pins. 

Pin assignment using concentric circles. The objective of this algorithm is to 
establish connections between a block and all its related pins in other blocks such 
that net crossings are minimized. The following simple algorithm [3.6], introduced 
in 1972, assumes that all outer pins (pins outside of the current block) have fixed 
locations. All inner pins (pins in the current block) will be assigned locations based 
on the locations of the electrically-equivalent outer pins. The algorithm uses two 
concentric circles - an inner circle for the pins of the block under consideration, and 
an outer circle for the pins in the other blocks. The goal is to assign legal pin 
locations to both circles such that there is no net overlap. 



Example: Pin Assignment Using Concentric Circles (Including Algorithm) 

Given: (1) set of pins on the block (black) and (2) set of pins 

on the chip (white) to which the block pins must connect. 

Tasli: perfomi pin assigmnent using concentric circles such < 

that each pin on the block is cormected to exactly one pin on ' 

the chip, and vice-versa (i.e., a 1-to-l mapping). 



Solution: 

Determine the circles. The two circles are drawn such that 
(1) all pins that belong to the block (black) are outside of the 
irmer circle but within the outer circle and (2) all external 
pins (white) are outside of the outer circle. 
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Determine the points. For each point, draw a line from that point to the center of the circles 
(left). Move each outer (white) point to its projection on the outer circle, and move each inner 
(black) point to its projection on the inner circle (right). 







6 



Determine initial mapping. The initial mapping is an assignment from each outer pin to a 
corresponding inner pin. Choose a starting point and assign it arbitrarily (left). Then, assign 
the remaining points in clockwise or counter-clockwise direction (right). 






4».,J( ■■• 



Optimize the mapping. Repeat the mapping process fr)r other outer-inner point-pair 
combinations. That is, fr)r the same starting point on the outer circle, assign a diflferent point 
on the inner circle and map the remaining points. Do this until all point-pair combinations 
have been considered. The best mapping is the one with the shortest Euclidean distance. For 
the problem instance, a possible mapping is shown on the left, the best mapping is shown in 
the center, and the final pin assignment is shown on the right. 








For each remaining block, go to the first step. 



Topological pin assignment. In 1 984, H. N. Brady improved the concentric-circle 
pin assignment algorithm by taking into account external block positions and 
multi-pin nets (connected to more than two pins) [3.2]. Specifically, this enabled pin 
assignment when external pins are behind other blocks or obstacles. 



Fig. 3.18(a) shows an example assigning pins fi'om the main component m to an 
external block b. A midpoint line l„,-.t is drawn from the center of m through the 
midpoint ofh. In Fig. 3.18(b), the pins ofb are "unwrapped" and expanded as a line 
/' at the dividing point d - the farther point on /,„_/, that intersections b. The pins are 
then projected onto the outer circle (from the original concentric- circle algorithm). 
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m 



(a) 



/' 





Fig. 3.18 Pin assignment from the main component m to an extemal block h. (a) The main 
component m is collapsed as a single point. A midpoint line /,„_i is drawn from m through the center 
ofh. A dividing point d is formed at the farther point where /,„-4 intersects h. (b) From d, project h's 
pins onto / '. (c) The pins are then projected onto the outer circle of m [3.2]. 

This improvement also allows several blocks to be considered simultaneously. Let 
the set of all extemal blocks be denoted as B, and let the main component be 
denoted as m. A midpoint line l„,^t is drawn from m through the midpoint of every 
extemal block b e B. 

A dividing point on block b is formed at the farther point where /„,-,>, intersects b, and 
at the closer point where each midpoint line l,„^i,- intersect b, where b' e B is located 
between m and b. Based on these dividing points, the pins are separated and 
extended accordingly (Fig. 3.19). 




Fig. 3.19 Pin assigmnent on blocks a and b [3.2]. On block a, consider the midpoint lines /,„_„ and 
li„^h- The dividing point di is formed because it is the farther point on /,„_„; the dividing point A is 
formed because it is the closer point on l„^[,. On block b, consider only /,„_;,. The dividing point di is 
formed because it is the farther point on /,„_6. Using di-di, the pins are "imwrapped" accordingly 
when projected onto m's outer circle. 
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3.7 Power and Ground Routing 



On-chip supply voltages scale more slowly than chip frequencies and transistor 
counts. Therefore, currents supplied to the chip steadily increase with each 
technology generation. Improved packaging and cooling technologies, together with 
market demands for functionality, lead to ever-greater power budgets and denser 
power grids. Today, up to 20-40% of all metal resources on the chip are used to 
supply power (VDD) and ground (GND) nets. Since floorplanning precedes 
place-and-route, i.e., block- and chip-level implementation, power-ground planning 
has become an essential part of the modem chip planning process. 

Chip planning determines not only the layout of the power-ground distribution 
network, but also the placement of supply I/O pads (with wire-bond packaging) or 
bumps (with flip-chip packaging). The pads or bumps are preferentially located in or 
near high-activity regions of the chip to minimize the V ^ IR voltage drop.^ In 
general, the power planning process is highly iterative, including (1) early 
simulation of major power dissipation components, (2) early quantification of chip 
power, (3) analyses of total chip power and maximum power density, (4) analyses of 
total chip power fluctuations, (5) analyses of inherent and added fluctuations due to 
clock gating, and (6) early power distribution analysis: average, maximum and 
multi-cycle fluctuations. 

To construct an appropriate supply network, many aspects of the design and the 
process technology must be considered. For example, to estimate chip power, the 
designer must plan for (1) use of low- F,/, devices and dynamic circuits that consume 
more power, (2) use of clock gating for low power, and (3) quantity and placement 
of added decoupling capacitors that mitigate switching noise. 



Power and ground rings 
per block or abutted block's 



/^ 




Trunks connect rings 
to each other or to 
top-level power ring 



Fig. 3.20 Custom style of power-ground distribution for a chip floorplan. 



A supply I/O pad can deliver tens of milliamperes of current, while a supply bump can deliver 
hundreds of milliamperes, i.e., an order of magnitude more current. As an example of how much 
die resource goes to power-ground distribution - the Intel Pentium 4 microprocessor chip uses 
423 bumps, of which 223 are for delivery of VDD and GND. 
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This section discusses the physical design of power-ground distribution networks. 
Fig. 3.20 illustrates conceptually how a floorplan in a custom design approach might 
associate supply rings with each block, for later connection to a chip-level power 
distribution plan such as those discussed below. 

^ 3.7.1 Design of a Power-Ground Distribution Network 

The supply nets, VDD and GND, connect each cell in the design to a power source. 
As each cell must have both VDD and GND connections, the supply nets (1) are 
large, (2) span across the entire chip, and (3) are routed first before any signal 
routing. Core supply nets are distinguished I/O supply nets, which are typically at a 
higher voltage. In many applications, one core power net and one core ground net 
are sufficient. Some ICs, such as mixed-signal or low-power (supply-gated or 
multiple voltage level) designs, can have multiple power and ground nets. 

Routing of supply nets is different from routing of signals. Power and ground nets 
should have dedicated metal layers to avoid consuming signal routing resources. In 
addition, supply nets prefer thick metal layers - typically, the top two layers in the 
back-end-of-line process - due to their low resistance. When the power-ground 
network traverses multiple layers, there must be sufficient vias to carry current while 
avoiding electromigration and other reliability issues. 

Since supply nets have high current loads, they are often much wider than standard 
signal routes. The widths of the individual wire segments may be tailored to 
accommodate their respective estimated branch currents. For logic gates to have 
correct timing performance, the net segment width must be chosen to keep the 
voltage drop, V^IR, within a specified tolerance, e.g., 5% of VDD. Wider segments 
have lower resistance, and hence lower voltage drop.^ 

There are two approaches to the physical design of power-ground distribution - the 
planar approach, which is used primarily in analog or custom blocks, and the mesh 
approach, which is predominant in digital ICs. 

► 3.7.2 Planar Routing 

Power supply nets can be laid out using planar routing when (1) only two supply 
nets are present in the design, and (2) a cell needs a connection to both supply nets. 
Planar routing separates the two supply regions by a Hamiltonian path that connects 
all the cells, such that each supply net can be attached either to the left or right of 
every cell. The Hamiltonian path allows both supply nets to be routed across the 
layout -one to the left and one to the right of the path- with no co«///c?i' (Fig. 3.21). 



Some design manuals will refer to an IR drop limit of 1 0% of VDD. This means that the supply 
can drop (droop) by 5% of VDD and the ground can bounce by 5% as well, resulting in a 
worst-case of 1 0% supply reduction. 
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Fig. 3.21 Given a Hamiltonian patli tliat connects all cells, each supply net has its own 
"uninterrupted" path to each cell, and hence both supply nets can be routed on one layer. 

Routing the power and ground nets in this planar fashion can be accomphshed with 
the following three steps. 

Step 1: Planarize the topology of the nets 

As both power and ground nets must be routed on one layer, the design should be 
split using the Hamiltonian path. In Fig. 3.22(a), the two nets start from the left and 
right sides. Both nets grow in a tree-like fashion, without any conflict (overlap), and 
separated by the Hamiltonian path. The exact routes will depend on the pin locations. 
Finally, cells are connected wherever a pin is encountered. 

Step 2: Layer assignment 

Net segments are assigned to appropriate routing layers based on routability, the 
resistance and capacitance properties of each available layer, and design rule 
information. 



Step 3: Determining the widths of the net segments 

The width of each segment (branch) depends on the maximum current flow. That is, 
a segment's width is determined from the sum of the currents from all the cells to 
which it connects in accordance with Kirchhoff's Current Law (KCL). When dealing 
with large currents, designers often extend the "width" of the "planar" route in the 
vertical dimension with superposed segments on multiple layers that are stapled 
together with vias. In addition, the width detennination is typically an iterative 
process, since currents depend on timing and noise, which depend on voltage drops, 
which, in turn, depend on currents. In other words, there is a cyclic dependency 
within the power, timing and noise analyses. This loop is typically addressed 
through multiple iterations and by experienced designers. After executing the above 
three steps, the segments of the power-ground route form obstacles during general 
signal routing (Fig. 3.22(b)). 
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(a) 



D n D 




n n n 



(b) 



Fig. 3.22 (a) Generating the topology of the two supply nets, (b) Adjusting the widths of the 
individual segments in (a) with respect to their maximum current loads. 

► 3.7.3 Mesh Routing 



Power-ground routing in modern digital ICs typically has a mesh topology that is 
created through the following five steps. 

Step 1: creating a ring 

Typically, a ring is constructed to surround the entire core area of the chip, and 
possibly individual blocks. The purpose of the ring is to connect the supply I/O cells 
and, possibly, electrostatic discharge protection structures with the global power 
mesh of the chip or block. For low resistance, these connections and the ring itself 
are on many layers. For example, a ring might use metal layers Metal2- MetalS 
(every layer except Metall). 

Step 2: connecting I/O pads to tlie ring 

The top left of Fig. 3.23 shows connectors from the FO pads to the ring. Each I/O 
pad will have a number of "fingers" emanating on each of several metal layers. 
These should be rnaxirnally connected to the power ring in order to rninimize 
resistance and maximize the ability to carry current to the core. 

Step 3: creating a mesh 

A power rnesh consists of a set of stripes at defined pitches on two or rnore layers 
(Fig. 3.23). The width and pitch of stripes are determined from estirnated power 
consumption as well as layout design rules. The stripes are laid out in pairs, 
alternating as VDD-GND, VDD-GND, and so on. The power mesh uses the 
uppermost and thickest layers, and is sparser on any lower layers to avoid signal 
routing congestion. Stripes on adjacent layers are typically connected with as many 
vias as possible, again to minimize resistance. 

Step 4: creating Metall rails 

The Metall layer is where the power-ground distribution network meets the logic 
gates of the design. The width (current supply capability) and pitch of the Metall 
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rails are typically determined by the standard-cell library. Standard-cell rows are laid 
out "back-to-back" so that each supply net is shared by two adjacent cell rows. 

Step 5: connecting the Metall rails to the mesh 

Finally, the Metall rails are connected to the mesh with stacked vias. A key 
consideration is the proper size of (number of vias in) the via stack. For example, the 
most resistive part of the power distribution should be the Metall segments between 
via stacks, rather than the stack itself In addition, the via stack is optimized to 
maintain routability of the design. For instance, a 1 x 4 array of vias may be 
preferable to a 2 x 2 array, depending on the direction in which routing is congested. 

Figure 3.23 illustrates the mesh approach to power-ground distribution. In the figure, 
layers MetalS through MetaW are used for the mesh. In practice, many chips will use 
fewer layers (e.g., MetalS and Metal? only) due to routing resource constraints. 
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Fig. 3.23 Construction of a mesh power-ground distribution network. 
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Chapter 3 Exercises 



Exercise 1: Slicing Trees and Constraint Graphs 

For the given floorplan (right), generate its slicing tree, 
vertical constraint graph and horizontal constraint graph. 
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Exercise 2: Floorplan-Sizing Algoritlim 

Three blocks a, b and c are given along with their size options. 
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(a) Determine the shape functions for each block a, b, c. 

(b) Find the minimum area of the top-level floorplan using the given tree structure 
and determine the shape function of the top-level floorplan. In the shape function, 
find the comer point that yields the minimal area. Finally, determine the dimensions 
of each block and draw the resulting floorplan. 




Exercise 3: Linear-Ordering Algorithm 

For the netlist with five blocks a-e and six nets Ni-Ne, determine the linear ordering 
that minimizes total wirelength. Let the starting block be block a. Place it in the first 
(leftmost) position. Draw the resulting placement. 



A^i = (a,e) 
N^ = (a,c,d) 
Ns = (b,c,d) 
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Exercise 4: Non-SUcing Floorplans 

Recall that the smallest non-slicing floorplans with no wasted space exhibit the 
structure of a clockwise or counter-clockwise wheel with five blocks. Draw a 
non-slicing floorplan with only four blocks a-d. 
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Global and Detailed Placement 



After partitioning the circuit into smaller modules and floorplanning the layout to 
determine block outlines and pin locations, placement seeks to determine the 
locations of standard cells or logic elements within each block while addressing 
optimization objectives, e.g., minimizing the total length of connections between 
elements. Specifically, global placement (Sec. 4.3) assigns general locations to 
movable objects, while detailed placement (Sec. 4.4) refines object locations to legal 
cell sites and enforces nonoverlapping constraints. The detailed locations enable 
more accurate estimates of circuit delay for the purpose of timing optimization. 



4.1 Introduction 



4.1 



The objective of placement is to determine the locations and orientations of all 
circuit elements within a (planar) layout, given solution constraints (e.g., no 
overlapping cells) and optimization goals (e.g., minimizing total wirelength). 
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Placement and Routing with Standard Cells 



Fig. 4.1 A simple circuit (top left) along with example linear placement (top right), 2D placement 
(bottom left), and placement and routing with standard cells (bottom right). 

Circuit elements (e.g., gates, standard cells and macro blocks) have rectangular 
shapes and are represented by nodes, while nets are represented by edges (Fig. 4.1). 
Some circuit elements may have fixed locations while others are movable 
(placeable). The placement of movable objects will determine the achievable quality 
of the subsequent routing stages. However, detailed routing information, such as 



A. B. Kahng et al., VLSI Physical Design: From Graph Partitioning to Timing Closure, 
DOI 10.1007/978-90-481-9591-6_4, © Springer Science+Business Media B.V. 2011 
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track assignment, is not available at the placement stage, and hence the placer 
estimates the eventual Manhattan routing, where only horizontal and vertical wires 
are allowed (Sec. 4.2). In Fig. 4.1, assuming unit distance between horizontally- or 
vertically-adjacent placement sites, both the linear and 2D placements have 10 units 
of total wirelength. 

Placement techniques for large circuits encompass global placement, detailed 
placement and legalization. Global placement often neglects specific shapes and 
sizes of placeable objects and does not attempt to align their locations with valid grid 
rows and columns. Some overlaps are allowed between placed objects, as the 
emphasis is on judicious global positioning and overall density distribution. 
Legalization is performed before or during detailed placement. It seeks to align 
placeable objects with rows and columns, and remove overlap, while trying to 
minimize displacements from global placement locations as well as impacts on 
interconnect length and circuit delay. Detailed placement incrementally improves 
the location of each standard cell by local operations (e.g., swapping two objects) or 
shifting several objects in a row to create room for another object. Global and 
detailed placement typically have comparable runtimes, but global placement often 
requires much more memory and is more difficult to parallelize. 

Performance-driven optimizations can be applied during both global placement and 
detailed placement. However, timing estimation (Chap. 8) can be inaccurate during 
early stages of global placement. Thus, it is more common to initiate perfonnance 
optimizations during later stages of, or after, global placement. Legalization is often 
performed so as to minimize impact on performance, and detailed placement can 
directly improve performance because of its fine control over individual locations. 



4.2 



4.2 Optimization Objectives 



Placement must produce a layout wherein all nets of the design can be routed 
simultaneously, i.e., the placement must be routable. In addition, electrical effects 
such as signal delay or crosstalk must be taken into consideration. As detailed 
routing information is not available during placement, the placer optimizes estimates 
of routing quality metrics, such as total weighted wirelength, cut size, wire 
congestion (density), or maximum signal delay (Fig. 4.2). 
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Fig. 4.2 Examples of routing quality metrics optimized during placement. 
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Subject to maintaining routability, the primary goal of placement is to optimize 
delays along signal paths. Since the delay of a net directly correlates with the net's 
length, placers often minimize total wirelength} As illustrated in Fig. 4.3, the 
placement of a design strongly affects the net lengths as well as the wire density. 
Further placement optimizations, beyond the scope of this chapter, include 

— Placement and assignment of FO pads with respect to both the logic gates 
connected to them and the package connection points, e.g., bump locations. 

— Temperature- and reliability-driven optimizations, such as the placement of 
actively switching (and heat-generating) circuit elements, to achieve uniform 
temperature across the chip. 
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Fig. 4.3 Two placements 
of the same design with 
better (left) and worse 
(right) total wirelength. 



Wirelength estimation for a given placement. An important consideration for 
wirelength estimation during placement is the speed at which a given wirelength 
estimate can be computed. Moreover, the estimation must be applicable to both 
two-pin and multi-pin nets. For two-pin nets, most placement tools use the 
Manhattan distance c/^ between two points (pins) Pi {xx,y^ andP2 {^i,yi) 

For multi-pin nets, the following estimation techniques are used [4.23]. 



The half-perimeter wirelength (HPWL) model is commonly 
used because it is reasonably accurate and efficiently 
calculated. The bounding box of a net with p pins is the 
smallest rectangle that encloses the pin locations. The 
wirelength is estimated as half the perimeter of the 
bounding box. For two- and three-pin nets (70-80% of all 
nets in most modem designs), this is exactly the same as the 
rectilinear Steiner minimum tree (RSMT) cost (discussed 
later in this section). Whenp > 4, HPWL underestimates the 
RSMT cost by an average factor that grows asymptotically 
as 7^. 



HPWL = 9 



Timing-driven placement techniques are discussed in Sec. 8.3. 
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The complete graph {clique) model of a/»-pin net net has 



P{P-^) 



2\{p-2)\ 



edges, i.e., each pin is directly connected to every other pin. 
Since a spanning tree over the net's pins will have p - I 
edges, a correction factor of 2 //» is applied. The total edge 
length of net according to the clique model is 



L(net) - 



e^clique 



n 



Clique Lengtti = 14.5 



where e is an edge in the clique, and d}J^e) is the Manhattan 
distance between the endpoints of e. 



The monotone chain model connects the pins of a net using 
a chain topology. Both end pins have degree one; each 
intermediate pin has degree two. Finding a minimum-length 
path connecting the pin locations corresponds to the 
NP-hard Hamiltonian path problem. Thus, the monotone 
chain model sorts pins by either x- or j-coordinate and 
connects them accordingly. Though simple, this method 
often overestimates the actual wirelength. Another 
disadvantage is that the chain topology changes with the 
placement. 

The star model considers one pin as the source node and all 
other pins as sink nodes; there is an edge from the source to 
each sink. This is especially useful for timing optimization, 
since it captures the direction of signal flow from an output 
pin to one or more input pins. The star model uses only p - 
1 edges; this sparsity can be advantageous in modeling high 
pin-count nets. On the other hand, the star model 
overestimates wirelength. 



The rectilinear minimum spanning tree (RMST) model 
decomposes the p-pin net into two-pin connections and 
connects the p pins with p - I connections. Several 
algorithms (e.g., Kruskal's Algorithm [4.19]) exist for 
constructing minimum spanning trees. RMST algorithms 
can exploit the Manhattan geometry to achieve 0(p log p) 
runtime complexity (Sec. 5.6.1). 
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The rectilinear Steiner minimum tree (RSMT) model 
connects all p pins of the net and as many as p - 2 
additional Steiner (branching) points (Sec. 5.6.1). Finding 
an optimal set of Steiner points for an arbitrary point set is 
NP-hard. For nets with a bounded number of pins, 
computing an RSMT takes constant time. If the Steiner 
points are known, an RSMT can be found by constructing 
an RMST over the union of the original point set and the set 
of added Steiner points. 
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The rectilinear Steiner arborescence (RSA) model of a 
p-pin net is also a tree where a single source node Sq is 
connected to p - I sink nodes. In an RSA, the path length 
from Sq to any sink Sj,! <i<p-l, must be equal to the So ~ 
Sj Manhattan (rectilinear) distance. That is, for all sinks Sj 

L(so,si)^di^So,s,) 

where L(so, si) is the path length from sq to Sj in the tree. 
Computing a minimum-length RSA is NP-hard. 
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The single-tnink Steiner tree (STST) model consists of one 
vertical (horizontal) segment, i.e., tmnk, and connects all 
pins to this trunk using horizontal (vertical) segments, i.e., 
branches. STSTs are commonly used for estimation due to 
their ease of construction. RSAs are more timing-relevant 
but somewhat more complex to construct than STSTs. For 
practical purposes, both RSAs and STSTs are constructed in 
0(p log j9)-time, where p is the number of pins. 
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Total wirelength with net weights (weighted wirelength). Net weights can be used 
to prioritize certain nets over others. For example, a net net with weight w{net) = 2 is 
equivalent to two nets / andy with weights w{i) = w(J) = 1. For a placement P, an 
estimate of total weighted wirelength is 



L(P)= '^winet) ■ L{net) 



netEP 



where w(net) is the weight of net, and L{net) is the estimated wirelength of net. 
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Example: Total Weighted Wirelength of a Placement 

Given: (1) placement P of blocks a-/ and their pins (right) and 

(2) nets N1-N2 and their net weights. 

N,={at,hi,d2) w(Ni) = 2 

N2 = {c,A/,) w(N2) = 4 

N3 = (e,/2) W(N3) = 1 

Task: estimate the total weighted wirelength of P using the 
RMST model. 

Solution: 

i(7V,) = ddaubi) + ddbudi) = 4 + 3 = 7. 
L(N2) = dUciA) + dddji) = 2 + 2=4. 
L(N,) = dUei/2) = i. 

L{P) = w{Ni) ■ L{Ni) + w{N2) ■ L{N2) + wiN,) ■ UNj) 
= 2-7 + 4-4+l-3=14 + 16 + 3 = 33. 
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Maximum cut size. In Fig. 4.4, a (global) vertical cutline divides the region into a 
left region L and a right region R. With respect to the cutline, a net can be classified 
as either uncut or cut. An uncut net has pins in L or R, but not both, i.e., it is 
completely left or right of the cutline; a cut net has at least one pin in each L and R. 




Vertical Cutline 
Cut Net 



Fig. 4.4 Layout with a vertical cutline and a cut net. 



Given a placement P, let 

— Vp and Hp be the set of global vertical and horizontal cutlines of P, respectively 

— *P/>(cM?) be the set of nets cut by a cutline cut 

— ^i{cui) be the size oi'^iicui), i.e., \|//>(cm?) = |*I'/>(cm?)| 

Then, define X(P) to be the maximum of v|//>(v) over all vertical cutlines v e Vp. 

X(P) = max(v|/p(v)) 

Similarly, define Y{P) as the maximum of ^iQi) over all horizontal cutlines h e Hp. 

y(P)=max(v|/p(/!)) 



X{F) and Y(F) are lower bounds on routing capacity needed in the horizontal (x-) 
and vertical (y-) directions, respectively. For example, if X(F) =10, then some 
global vertical cutline x crosses 10 horizontal net segments. Likewise, if Y{F) = 15, 
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some global horizontal outline y crosses 15 vertical net segments. A necessary but 
not sufficient condition for routability is that there exist at least 10 (15) horizontal 
(vertical) routing tracks at x (y). Thus, X(P) and Y{P) can be used to assess 
routability of P. 

For some circuit layout styles, e.g., gate arrays, the capacity (maximum number) of 
horizontal and vertical tracks is pre-set. An optimization constraint in placement, 
necessary but not sufficient for routability, is to ensure thatX(P) and Y{P) are within 
these capacities. In the context of standard-cell design, X{P) gives a lower bound on 
the demand for horizontal routing tracks, and Y{P) gives a lower bound on the 
demand for vertical routing tracks. 



To improve routability of P, X{P) and Y{P) must be minimized. To improve total 
wire length of P, separately calculate the number of crossings of global vertical and 
horizontal cutlines, and minimize 



Z(P)=^Vp(v)+ Y^y^pih) 



v<^V„ 



heH„ 



Example: Cut Sizes of a Placement 
Given: (1) placement _P of bloclcs a-/ and their pins (right), 
(2) nets N1-N2, (3) global vertical cutlines vi and V2, and 
(4) global horizontal cutlines h] and h2. 

7Vi = (ai,*i,A) N2 = (ci,difi) A^3 = (ei/2) 
Task: determine the cut sizes X(P) and Y(P) of placement 
P according to the RMST model. 



Solution: 

Find the cut values for each global cutline. 

V|/p(vi) = 1 V|/p(V2) = 2 
\|/p(/!l) = 3 \|/p(/!2)=2 



Find the total number of crossings in P. 
v|/p(vi) + v|//<V2) + \\ip(hi) + \\ip(h2) = 1+2 



3+2 = i 
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Find the cut sizes. 

X{P) = max(v|/p(vi),\|/p(v2)) = max(l,2) = 2 

Y{P) = max(v|/p(/!i),\|/p(/!9)) = max(3,2) = 3 

Observe that moving block h from (0,0) to (0,1) reduces the number of crossings from 8 to 6. 
This also decreases \|/p(/!i) from 3 to 1, which makes Y{P) = 2, thereby reducing the local 
congestion. 



Routing congestion. The routing congestion of a placement P can be thought of in 
terms of density, namely, the ratio of demand for routing tracks to the supply of 
available routing tracks. For example, a routing channel has available horizontal 
routing tracks, while a switchhox has available vertical and horizontal routing tracks 
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(Fig. 4.5). For gate-array designs, this concept of congestion is particularly 
important, as the supply of routing tracks is fixed (Sec. 5.3). 
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Fig. 4.5 (a) A switchbox SB and a channel CH. (b) SB and CH represented by grid cells with wire 
capacities on the grid cell edges. 

For a given placement P, congestion can also be estimated by the number of nets 
that pass through the boundaries of individual routing regions. Formally, the local 
wire density (p/>(e) of an edge e between two neighboring grid cells^ is 



(pp(e) : 



T1p(e) 
a pie) 



where r\p(e) is the estimated number of nets that cross e and 0/>(e) is the maximum 
number of nets that can cross e. If (pp(e) > 1, then too many nets are estimated to 
cross e, making P more likely to be unroutable. The wire density ofP is 



^iP). 



max I 



{<?p{e)) 



where E is the set of all edges. If<^(P) < 1, then the design is estimated to be fully 
routable. If <!'(/') > 1, then routing will need to detour some nets through 
less-congested edges - but in some cases, this may be impossible. Therefore, 
congestion-driven placement often seeks to minimize 0(/'). 



Example: Wire Density of a Placement 
Given: (1) placement P of blocks a-f and their pins 
(right), (2) nets Ni-N^, (3) local vertical cutlines Vi-Vg, (4) 
local horizontal cutlines hi-h(„ and (5) Op(e) = 3 for all 
local cutlines e e E. 

TV, = (a^hiA) N2 = (c„rf,/,) N3 = (61/2) 
Task: find the wire density C)(P) and determine the 
routability of -P based on the RMST model. 




Such an edge represents the border between two regions (e.g., switchboxes or channels) or 
between two cutlines. 



4.3 Global Placement 



103 



Solution: 

The following is one of many possible solutions, as OC/") depends on how Ni-N^ are routed. 

Vertical Edges: 

Tlp(Vi) = 1 



Horizontal edges: 
ilp(/!,)=l 
Mhi) = 2 

T|p(/!4) = 1 
T]p(h,) = 1 
T\Ah6) = 

Maximum T|p(e) = 2. 



T]p(V2) = 
T\p(V3) = 
n/<V4) = 
TIpCVs) = 2 

TlKve) = 




2/3. Since <I>(P) < 1, P is estimated to be routable. 



Signal delays. The total wirelength of the placement affects the maximum clock 
frequency for a given design, as it depends on net (wire) delays and gate delays. In 
earlier process technologies, gate delays accounted for the majority of circuit delay. 
With modem processes, however, due to technology scaling, wire delays contribute 
a significant portion of overall path delay. 



Circuit timing is usually verified using static timing analysis (STA) (Sec. 8.2.1), 
based on estimated net and gate delays. Common terminology includes actual 
arrival time {AAT) and required arrival time {RAT), which can be estimated for 
every node v in the circuit. AAT{v) represents the latest transition time at a given 
node V measured from the beginning of the clock cycle. RAT{v) represents the time 
by which the latest transition at v must complete in order for the circuit to operate 
correctly within a given clock cycle. For correct operation of the chip with respect to 
setup (maximum path delay) constraints, it is required thai AAT{v) <RAT{v). 
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4.3 



Techniques for circuit placement are summarized as follows (Fig. 4.6). In 
partitioning-based algorithms, the netlist and the layout are divided into smaller 
sub-netlists and sub-regions, respectively, according to cut-based cost functions. 
This process is repeated until each sub-netlist and sub-region is small enough to be 
handled optimally. An example of this approach is min-cut placement (Sec. 4.3. 1). 

Analytic techniques model the placement problem using an objective (cost) function, 
which can be maximized or minimized via mathematical analysis. The objective can 
be quadratic or otherwise non-convex. Examples of analytic techniques include 

quadratic placement and force-directed placement (Sec. 4.3.2). 



In stochastic algorithms, randomized moves are used to optimize the cost function. 
An example of this approach is simulated annealing (Sec. 4.3.3). 
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Fig. 4.6 Common techniques for global placement. 

► 4.3.1 Min-Cut Placement 

In the late 1970s, M A. Breuer [4.1] studied min-cut placement, which uses 
partitioning algorithms to divide (1) the netlist and (2) the layout region into smaller, 
sub-netlists and sub-regions, respectively. The sub-netlists and sub-regions are 
repeatedly divided into even smaller partitions until each sub-region contains only a 
small number of cells. Conceptually, each sub-region is assigned a portion of the 
original netlist. However, when implementing min-cut placement, the netlist should 
be divided such that each sub-region has access to its own unique (induced) 
sub-netlist. 



Each cut heuristically minimizes the number of cut nets (Sec. 4.2). Standard 
algorithms used to minimize the number of cut nets are the Kemighan-Lin (KL) 
algorithm (Sec. 2.4.1) and the Fiduccia-Mattheyses (FM) algorithm (Sec. 2.4.3). 

Min-Cut Aigorithim 

input: netlist Netlist, layout area LA, minimum number of cells per region cells_min 

Output: placement P 

1. 
2. 
3. 
4. 
5. 
6. 
7. 



P=0 

regions = ASS\GN{Netlist,LA) 

wliiie {regions != 0) 

region = FIRST_ELEIVIENT(reg/ons) 

REMO\/E{regions, region) 



II assign netlist to layout area 
// while regions still not placed 
// first element in regions 
II remove first element of regions 



9. 

10. 
11. 
12. 



if {region contains more than cellmin cells) 
(sri,sr2) = BISECT(reg/on) // divide region into two subregions 

// sn and sr2, obtaining the sub- 
// netlists and sub-areas 
ADD_TO_END(reg/ons,sri) // add sr^ to the end oi regions 

ADD_TO_END(reg/ons,sr2) // add sr2 to the end of regions 

eise 
PLACE(reg/on) // place region 

ADD{P,region) II add region to P 



Min-cut optimization is performed iteratively, one cutline at a time. That is, heuristic 
minimum cuts are found for the current sub-netlist, based on the current sub-regions. 
Ideally, the algorithm should directly optimize the placement figures of merit X{P), 
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Y(P) and L{P) (Sec. 4.2). However, this is computationally infeasible, especially for 
large designs. Moreover, even if every cut were optimal, the final solution would not 
necessarily be optimal. Therefore, the algorithm iteratively minimizes cut size, 
where the cut size is minimized on the first (horizontal) cut cuti, then minimized on 
the second (vertical) cut cm?2, and so on. 

minimize(v|//>(cM?i)) -^ minimize(v|//)(cM?2)) -^ ■■■ -^ minimize(v|/p(cM?|c«(s|)) 

Let Cuts be the set of all cutlines made in the layout region. The sequence cuti, 
cutj, ... , cut\cuts\ denotes the order in which the cuts are made. Possible approaches 
to dividing the layout include using alternating and repeating cutline directions. 

With alternating cutline directions, the algorithm divides the layout by switching 
between sets of vertical and horizontal cutlines. In Fig. 4.7(a), the horizontal cutline 
cut\ is made first to bisect the region. Then, two vertical cutlines cut2a and cut2b are 
made, one {cutja) in the top half and the other (cut2b) in the bottom half Each newly 
formed region is divided in half again with four horizontal cutlines cut2a-cut2d, 
followed by eight vertical cutlines cutna-cutm,. This approach is suitable for 
standard-cell designs with high wire density in the center of the layout region. 



With repeating cutline directions, the layout is divided using only vertical 
(horizontal) cutlines until each column (row) is the width (height) of a standard cell. 
Then, the layout is divided using the orthogonal set of cutlines, e.g., if horizontal 
cutlines are first used to generate rows, then vertical lines are used to divide each 
row into columns. In Fig. 4.7(b), the horizontal cut cut\ is made first to divide the 
region in half Then, cutja and cutjt divided the region into four rows. Next, four 
vertical cuts cutT,a- cut^d are made, followed by eight vertical cuts cut^a- cut^i,- This 
approach often results in greater wirelength because the aspect ratios of the 
sub-regions can be very far from one. For example, when bisecting the sub-netlist 
with cut^a, very little information is available regarding the x-locations of adjacent 
standard cells in the other sub-regions. 

2a 
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Fig. 4.7 Partitioning a region using different cutline approaches, (a) Alternating cutline directions, 
(b) Repeating cutline directions. 
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Example: Min-Cut Placement Using the KL Algorithm 

Given: (1) circuit with gates a-/(left), (2) 2 x 4 layout (right), and (3) initial vertical cut cut[. 
cut, I 




EH5)- 



n 



Task: find a placement with minimum wirelength using alternating cutline directions and the 
KL algorithm. 

Solution: 

After vertical cut cuti, L = {a,h,c} and R = {d,ef} . Partition using the KL algorithm. 

D(a) = 1 D{d) = 1 O ^A 

D{b) = \ D(e) = -\ 
D(c) = l D{f) = -l 
D(0) = D(0) = 
Agi = D{c) + D{d) - lc{c,d) = 1 + 1 - 2(0) = 2. Swap nodes c and d. 

After horizontal cut CM%, T= {a,d},B= {h}. After horizontal cut CMfo;;, T= {c,e},B= {f}. 
A—, 





D{a) = -l D{d) = 
D{h) = 1 Z)(0) = 



cut-, 



cut-, 



\^ 



No swapping because 
no Ag > 0. 




D{c) = -l D{e) = 
D{0) = D(f) = 1 



No swapping because 
no Ag > 0. 



Make four vertical cuts cut^n, cut^TR, cut^Bi and cutisf,. Each region has only one node, so 
terminate the algorithm. 

After external pin consideration (not shown), the final placement is 



cut- 



cut-. 



cut-,, 



© 






<D 



-cut-. 



cut-. 



cut, cut-. 
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Example: Min-Cut Placement Using the FM Algorithm 

Given: (1) circuit with gates a-g (left), (2) gate areas, (3) ratio factor r = 0.5, (4) initial 

partitioning with vertical cut cut\, and (5) a 2 x 4 layout (right). 

area(INV) = 1, area(NAND) = 2, area(HOK) = 2 





Task: find a placement with minimum wirelength using alternating cutline directions and 
using the FM algorithm. 

Solution: 

Initial vertical cut «;?i : L = {a,b,c} , R = {d,ef,g} . 

Balance criterion: 0.5 ■ 1 1 - 2 < area(A) < 0.5 ■ 1 1 + 2 = 3.5 < area(A) < 7.5. 

Iteration 1: cw^i 

Gates a, h, c and g have maximum gain Agi = 1. — To'^W-Lfw'So- 

Balance criterion for gates a, b and c is violated: I— ^ ! '^ 

area(A) <3.5. 

Balance criterion for gate g is met: area{A) = 6. — m^*°- 

Move gate g. 



^>> 




Iteration 2: I j.^ 

Gate a has maximum gain Ag2{a) = 1, area(A) = 4, ^ " /> — \^^ 

balance criterion is met. 

Move gate a (fiarther selection steps result in negative 

Ag and are omitted). 

Maximum positive gain G2 = Agi + Ag2 = 2. 

After cut cuti, L = {a,d,ef} , R = {h,c,g} , cut cost = 1 . 

After cut cutu, T = {a,d} , B = {ef} , cut cost = 1 . 

After cut cutis, T= {c},B = {h,g}, cut cost = 1. 

Make three more cuts cut^n, cuI^tr and ciitiBR such 
that every sub-region has one gate. 

After external pin consideration (not shown), the final placement is 




cut. 



cut-, 



cut. 



f 



cut, 



J c —t— g -U— h 1 



vy 



- ^ H^ -fOr^^ 


1 -I>R5>| "M 
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Naive min-cut algorithms do not consider the locations of connection pins within 
part:itions that have already been visited. Likewise, the locations of fixed, external 
pin connections (pads) are also ignored. However, as illustrated in Fig. 4.8, cell a 
should properly be placed as close as possible to where the terminal/? ' is located. 




Fig. 4.8 Placement of cell a is close to the terminal p ', 
which represents a connection to a neighboring partition. 



Formally developed by A. E. Dunlop and B. W. Kernighan [4.8], terminal 
propagation considers external pin locations during partitioning-based placement. 
During min-cut placement, external connections are represented by artificial 
connection points on the cutline and are dummy nodes in hypergraphs. The locations 
of the connection points affect placement cost functions and hence cell placements. 

Min-cut placement with external connections assumes that the cells are placed in the 
centers of their respective partitions. If the related connections (dummy nodes) are 
close to the next partition cutline, these nodes are not considered when making the 
next cut. Dunlop and Kernighan define close as projecting onto the middle third of 
the region boundaiy that is being cut [4.8]. 



Example: Min-Cut Placement With External Connections 
Given: gates a-d of a circtiit (right). 
Task: place the gates in a 2 x 2 grid. 



^^|D^ 



Solution: 

Let gates a-d be represented by nodes a-d. Partition into L and R. Partition cost = 1. 



(^ 



Partition L into TL and BL independent of R, since the connection x is close to the new 
horizontal cutline (assume that nodes are placed in the centers of their respective partitions). 



^r^ 



m ♦ 



TL 




(b)- 


~^\ 




l^'A 




^ c ) 


Caj 


\Ly 


BU*^ 


R 
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Partition R into TR and BR depending on L, since the connection x is not close to the cutline. 
The addition of the dummy node /> ' means that node d moves to grid TR and node c moves to 
grid5i? so that the partition cost of the TR-BR partition cost = 1. 



0- 


X 

% 


5 



b H 


, TK 


a 


Q 




^i? 




Note that without p ', the cost of placing c in TR and node d in BR is still 1 , but the total cost 
with consideration of i becomes 2. 




(^' 




my- 



If a net net crosses the partitioned region, all its pins, including those that lie outside 
the region, must be considered. A rectilinear Steiner minimum tree (RSMT) (Sec. 
4.2) is constructed that includes each pin outside the region. Each point at which this 
tree crosses the region's boundary induces a dummy nodep' for net. Ifp' is close to 
a cutline, then it is ignored during partitioning. Otherwise, it is considered during 
partitioning along with cells that are contained in the region. 



Example: Min-Cut Placement Considering Pins Outside of the Partitioned Regio n 

Given: (1) layout region (right) and (2) net A'l 

with three pins a-c not in the partitioned region. 

Task: account for pins a-c. N, 



Solution: 



For pins a-c, construct a rectilinear Steiner 
minimum tree (RSMT). Label all points of that 
tree that cross the layout region with dummy 
nodes />n'-p£.'- 



If a vertical cut cuty is made, then pi, ' is ignored 
because it is close to the cuthne, while Pi, ' and 
Pc ' are considered as real nodes. If a horizontal 
cut cutn is made, thenp^ ' is ignored, andpa ' and 
Pb ' are considered as real nodes. 



Pa' 


N, 


Pc' 


Pi,' 







cuty 


b 








1 ^'' 


cut„ 
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► 4.3.2 Analytic Placement 

Analytic placement minimizes a given objective, such as wirelength or circuit delay, 
using mathematical techniques such as numerical analysis or linear programming. 
Such methods often require certain assumptions, such as the differentiability of the 
objective fiinction or the treatment of placeable objects as dimensionless points. For 
example, to facilitate the calculation of partial derivatives, it is common to optimize 
quadratic, rather than linear wirelength. When such algorithms place cells too close, 
i.e., creating overlaps, the cell locations must be spread further apart by dedicated 
post-processing techniques, so as to remove overlap. 

Quadratic placement. The squared Euclidean distance 

L{P)= Y, c{i, 7)((x,- - X,. )' + {y, -yjf] 

is used as the cost fiinction, n is the total number of cells, and c(ij) is the connection 
cost between cells / andy. If cells / andy are not connected, then c(ij) = 0. The tenns 
(x,- Xy)^ and (y, -yjf' respectively give the squared horizontal and vertical distances 
between the centers of / andy. This formulation implicitly decomposes all nets into 
two-pin subnets. The quadratic fonn emphasizes the minimization of long 
connections, which tend to have negative impacts on timing. 

Quadratic placement consists of two stages. During global placement (first stage), 
cells are placed so as to minimize the quadratic fiinction with respect to the cell 
centers. Note that this placement is not legal. Usually, cells appear in large clusters 
with many cell overlaps. During detailed placement (second stage), these large 
clusters are broken up and all cells are placed such that no overlap occurs. That is, 
detailed placement legalizes all the cell locations and produces a high-quality, 
non-overlapping placement. 

During global placement, each dimension can be considered independently. 
Therefore, the cost function L(F) can be separated into x- and j-components 

n n 

L, (P) = ^ di, J)ix, -xjf and Z,, (P) = ^ c(i, j){y, - y jf 

'=1,7=1 '=1,7=1 

With these cost functions, the placement problem becomes a convex quadratic 
optimization problem. Convexity implies that any local minimum solution is also a 
global minimum. Hence, the optimal x- and j'-coordinates can be found by setting 
the partial derivatives ofLJJ') and Ly{P) to zero, i.e., 

dL (P) SLJP) 

^ ^'^ ' =AX-b=Oand ^ ' =AY-b=0 

dX "" dY y 
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where ^ is a matrix with ^[/][/] = -c{ij) when ; 7^7, and ^[;][/] = the sum of incident 
connection weights of cell i. X is a vector of all the x-coordinates of the non-fixed 
cells, and b^ is a vector with bx\i\ = the sum of x-coordinates of all fixed cells 
attached to /. 7 is a vector of all the j'-coordinates of the non-fixed cells, and by is a 
vector with byli] = the sum of j'-coordinates of all fixed cells attached to /. 

This is a system of linear equations for which iterative numerical methods can be 
used to find a solution. Known methods include the conjugate gradient (CG) 
method [4.27] and the successive over-relaxation {SOR) method. 

During detailed placement, cells are spread out to remove all overlaps. Known 
methods include those described for min-cut placement (Sec. 4.3.1) [4.33] and 
force-directed placement [4.9]. 



Example: Quadratic Placement 
Given: (1) placement P with two fixed points 
Pi (100,175) andp2 (200,225), (3) three free 
blocks a-c and (4) nets N1-N4. 
N^Pua) N2(a,b) N,(b,c) N^icP^) 



Task: find the coordinates of blocks {Xa,ya), {Xh,yb) and {Xc,y^. 



II 



Pi 







b ^: 



Solution: 

Solve forx-coordinates. 

L,(P) = ( 100 - .T„)' + (.T„ - x,f + (x„ - x,f + (x, - 200)' 

^-'' ^^' = -2(1 00 - X J + 2(Xa - A-fc ) = Ax^ - 2xa - 200 = 
— — — = -2(x^ -X/,) + 2(x^ -x,.) = -2x„ +4xj -2x^ =0 
-2(x^ - X J + 2(x^ - 200) = -2x4 + 4x^ - 400 = 



5Z,(P) , 



Put in matrix fonnAX= b^. 



4 


-2 


0" 


Xa 




"200" 




-2 


4 


-2 


H 


= 





^ 





-2 


4 


^c_ 




400 





2 


-1 


0" 


^a 




"100" 


-1 


2 


-1 


Xh 


= 








-1 


2 


_^c_ 




200 



Solve for^: x„ = 125, x,, = 150, x, = 175. 

Solve forjt'-coordinates. 

mP) = (175 -y,f + iy, -ytf + {yt -ycf + (y. - 225)" 

-^ = -2(175->'J + 2(>^„->',)=4>^,-2>^j -350 = 

ya 

dLyjP) 

Vb 
dLyiP) 



-2(ya -yh)+ 2(yt -yc) = -2ya + ^yi, -2yc=o 



yc 



-^(.yb-yc ) + 2iyc - 225) = -2yh + Ay, - 450 = 
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Put in matrix form^F= Z),,. 



4 


-2 


o" 


ya 




"350" 




"2 -1 O" 


ya 




"175" 


-2 


4 


-2 


yi, 


= 





-> 


-1 2 -1 


yb 


= 








-2 


4 


yc_ 




450 




0-12 


yc_ 




225 



Solve for Y: v„ = 187.5,>'4 = 200, j', = 212.5. 



Final solution: a (125,187.5), b (150,200) andc (175,212.5). 



,^ 



ICc 



^ 



Force-directed placement. During force-directed placement, cells and wires are 
modeled using the mechanical analogy of a mass-spring system, i.e., masses 
connected to Hooke 's-Law springs. Each cell exercises attraction toward other cells, 
where the attraction force is directly proportional to distance. If free movement is 
granted to all elements of the mass-spring system, then all cells will eventually settle 
in a configuration that achieves force-equilibrium. Relating this back to circuit 
placement, the wirelength will be minimized if all cells reach their equilibrium 
locations. Thus, the goal is to have all cells settle to a placement with force 
equilibrium. 

Force-directed placement, developed in the late 1970s by N. R. Quinn [4.24], is a 
special case of quadratic placement. The potential energy of a stretched 
Hooke 's-Law spring between cells a and b is directly proportional to the squared 
Euclidean distance between a and b. Furthermore, the spring force exerted on a 
given cell ; is the partial derivative of potential energy with respect to /'s position. 
Thus, determining the energy-minimal positions of cells is equivalent to minimizing 
the sum of squared Euclidean distances. 

Given two connected cells a and b, the attraction force F^ij exerted on a by ft is 

Fah=c{a,b)-{b-a) 

where c{a,b) is the connection weight (priority) between cells a and b, and {b - a) 

is the vector difference of the positions of a and b in the Euclidean plane. The sum 
of forces exerted on a cell ; connected to other cellsy is (Fig. 4.9) 



c{i,j)^0 



The position that minimizes this sum of forces is known as the zero-force target 
(ZFT). 
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min F, = c(i,a) ■ (a - /) + c{i,b) ■ {b - i) 
+ c(i,c) ■ (c -7) + c{i,d) ■ (5 -7) 



Fig. 4.9 ZFT position of cell i, which is 
connected to four other cells a-d. 



Using this fomiulation, two possible extensions are as follows. First, in addition to 
the attraction force, a repulsion force can be added for unconnected cells to avoid 
overlap and take into account the overall layout (for a balanced placement). These 
forces form a system of linear equations that can be solved efficiently using analytic 
techniques, as with quadratic placement. Second, for each cell, there is an ideal 
minimum- energy (ZFT) position. By iteratively moving each cell to this position (or 
nearby, if that position is already occupied) or by cell swapping, gradual 
improvements can be made. In the end, all cells will be in the position with 
minimum-force equilibrium configuration. 

Basic force-directed placement algorittims. Force-directed placement algorithms 
iteratively move all cells to their respective ZFT positions. In order to find a cell /'s 
ZFT position {xl^,y°), all forces that affect / must be taken into account. Since the 
ZFT position minimizes the force, both the x- andj'- direction forces are set to zero. 

2;c(/,7)-(x°-xf) = and Y.cii,J)-iy';-yf) = 
Rearranging the variables to solve forx," andj'," yields 



^c(;,7)-x° 

_ c{i,J)^0 



and yf=£ihJ^ 






Using these equations, 
can be computed. 



the ZFT position of a cell ; that is connected to other cells j 



Example: ZFT Position 

Given: (1) circuit with NAND gate (cell) a (left), 
(2) four I/O pads - Inl-hii and Out - and their 
positions, (3) a 3 x 3 layout grid (right), and (4) 
weighted connections. 

/n 1(2,2) In2{0,2) In3 (0,0) Out {2,0) 
c(a/«l) = 8 c(ajn2)= 10 

c(ajn3) = 2 c(a,Out) = 2 



In I 
In2 
In3 



-^^a)—Out 





11^ ^ 


1 






oH^ 




^ 


1 2 
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Task: find the ZFT position of cell a. 



Solution: 

2_,c(a,j)-x° 
_ c(ij)^0 _ c{a,Inl)-Xji,i +c{a,In2)-Xi„2+c(a,In3)-Xj„^ +c{a,Out)-Xout 



•2 + 10-0 + 2-0 + 2-2 20 



c{a,In\) + c{a,Inl) + c(a,Iti3) + c{a,Out) 



8 + 10 + 2 + 2 



22 



:0.9 



_ c(ij)*0 _ c(a,M)-yj„i +c(ajn2)-yj„2 +c(a,In3)-yj„2 +c(a,Out)-yQ,,, 

ya 



c(ij)*0 



c(a,InV) + c(a,In2) + c{a,In3) + c{a,Out) 



■2 + 10-2 + 2-0 + 20 _36^ 

8 + 10 + 2 + 2 ~22~ ■ 



ZFT position of cell a is at (1,2). 



In2- 



^^ 



^/nj 



■^Wj 



The following gives a high-level sketch of the force-directed placement approach. 



Force-Directed Placement Algorithm 
Input: set of all cells V 
Output: placement P 

1. P=PLACE{V) 

2. loc = LOCATIONS(P) 

3. foreach (cell c e V) 

4. status[c] = UNMOVED 

5. while (ALL_MOVED(\/) || !STOP()) 



6. c = MAK_DEGREE{V,status) 

7. ZFT_pos = ZFT_POSITION(c) 

8. if {loc[ZFTj30s] == 0) 

9. loc[ZFTj30s] = c 

10. else 

11. RELOCATE(c,/oc) 

12. status[c] = MOVED 



II arbitrary initial placement 

// set coordinates for each cell in P 



// continue until all cells have been 

// moved or some stopping 

// criterion is reached 

// unmoved cell that has largest 

// number of connections 

// ZFT position of c 

// if position is unoccupied, 

// move c to its ZFT position 

// use methods discussed below 
// mark c as moved 



Starting with an initial placement (line 1), the algorithm selects a cell c that has the 
largest number of connections and has not been moved (line 6). It then calculates the 
ZFT position of c (line 7). If that position is unoccupied, then c is moved there (lines 
8-9). Otherwise, c can be moved according to several methods discussed next (lines 
10-11). This process continues until some stopping criterion is reached or until all 
cells have been considered (lines 5-12). 
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Finding a valid location for a cell with an occupied ZFT position. Let p be an 

incoming cell and let q be the cell that is currently in/»'s ZFT position. Following 
are four options for how to rearrange p and q. 

— If possible, move/) to a cell position close to q. 

— Compute the cost difference iip and q were to be swapped. If the total cost 
reduces, i.e., the weighted connection length Z,(P) is smaller, then swap/» and q. 

— Chain move: cell/* is moved to cells q\ location. Cell q, in turn, is shifted to 
the next position. If a cell r is occupying this space, cell r is shifted to the next 
position. This continues until all affected cells are placed. 

— Ripple move: cellp is moved to cell q\ location. A new ZFT position for cell q 
is computed (discussed below). This ripple effect continues until all cells are 
placed. 



Example: Force-Directed Placement 

Given: (1) placement of blocks ^1-63 (right) and (2) weighted 

nets A'l and N2. 

N, = {buh) N2 = (h2,h3) c(Ni) = 2 c(N2)=l ~~0 I T 

Task: use force-directed placement to find a heuristic minimum-wirelength placement. 



^^^^^^ 



Solution: 

Consider cell 63. 



X^(*3,y) 



^'j 



ZFT position for 63 = x, = — ^A== = »0, which is occupied by cell 61. 

Y^c(h„j) 2+1 

L{P) before move = 3, L{P) after move = 3. Cells 63 and hi should not swap. 



Consider cell bj. 



Yj'^ihiJ) 



^■v° 



ZFT position for h2 = x, = — "'•{ , = = 2, which is occupied by cell 63. 

2^c(b2j) 1 

c(b2j)*0 

L(P) before move = 3, L(P) after move = 2. Cells b2 and b^ should swap. 



bi ,^ h h 



Force-directed placement with ripple moves. In force-directed placement with 
ripple moves [4.30], the cells are sorted in descending order according to their 
connection degrees (lines 1-3). During each iteration (line 4), the ZFT position for 
the next cell in the list {seed) is computed (line 9). If the position is free, seed is 
moved there (lines 12-17). If the position is occupied, then its current inhabitant is 
moved next, assuming it has not already been moved. In order to avoid infinite loops. 
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once a cell has been moved, it is marked as LOCKED and can no longer be moved 
until the next iteration (lines 1 8-40). 

Force-Directed Placement with Ripple Moves Algorithm 

Input: set of all cells V 

Output: positions of each cell pos 

1 . foreach (cell c e \/) 

2. degree[c\ = CONNECTION_DEGREE(c) // compute connection degree 

3. L = SORJ{V,clegree) II descending order by degree 

4. while {iteration_count < iterationjimit) 

5. end_ripple_move = false 

6. seed = NEXT_ELEMENT(L) // next cell in L 

7. pos_type[seed\ = VACANT II position type = VACANT 

8. while (enc/_r/pp/e_moi'e== false) 

9. currjDos = ZFT_POSITION(seecO 

1 0. curr_posJype = ZFT_POSITION_TYPE(seec/) 

1 1 . switch (curr_pos_type) 

12. case VACANT. 

1 3. pos[seed\ = MO\/E{seed,curr_pos) II move seed to curr_pos 

14. posJype[seed\ = LOCKED 

1 5. end_ripple_move = true 

1 6. abortcount = 

1 7. break 

18. case LOCKED: 

1 9. pos[seed\ = MOVE(seec/,NEXT_FREE_POS()) 

20. posJype[seed\ = LOCKED 

21 . end_ripple_move = true 

22. abort_count = abort_count + 1 

23. if {abort_count > abortjimit) 

24. foreach {posJype[c\ == LOCKED) 

25. RESET(pos_fype[c]) 

26. iteration_count = iteration_count + 1 

27. break 

28. case SAME_AS_PRESENT_LOCATION. 

29. pos[seed\ = MO\/E{seed,curr_pos) 

30. pos_type[seed\ = LOCKED 

31 . end_ripple_move = true 

32. abort_count = 

33. break 

34. case OCCUPIED: II occupied but not locked 

35. prev_cell = CELL{curr_pos) II cell in ZFT position 

36. pos[seed\ = MO\/E{seed,currjDos) 

37. posJype[seed\ = LOCKED 

38. seed = prev_cell 

39. end_ripple_move - false 

40. abort_count = 

41 . break 

In each iteration, an upper bound of fixed cells (abortjimit) is imposed. If the 
number of fixed cells exceeds this limit, then all fixed cells are released and the cell 
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with the next highest connectivity is considered in the new iteration (lines 23-26). 
Each cell has one of four options (lines 1 1-41). 

— VACANT: ZFT position is free. Place the cell there and proceed with the next 
iteration (lines 12-17). 

— LOCKED: ZFT position is occupied and fixed. Place the cell at the next free 
position and increment ahort_count. If ahort_count > abortjimit, stop the 
ripple moves and go on with the next cell in list L (lines 18-27). 

— SAME AS PRESENT LOCATION: ZFT position is the same as the current 
position of the cell. Place cell here and consider the next cell mL (lines 28-33). 

— OCCUPIED: ZFT position is occupied but not by a locked cell. Place the cell 
here and move the cell that had occupied the ZFT position (lines 34-41). 

The inner while loop (cell movement) is called only when end_ripple_move is false. 
Note that this flag is set to true when the seed cell's ZFT position is VACANT, 
LOCKED or SAME_AS_PRESENT_LOCATION. In these cases, the algorithm can 
move on with the next cell in the list. The outer while loop goes through list L and 
finds the best ZFT position of each cell until iteration Jimit is reached. 

^ 4.3.3 Simulated Annealing 

Simulated annealing (SA) (Sec. 3.5.3) is the basis of some of the most well-known 
placement algorithms. 

Simulated Annealing Algorithm for Placement 
Input: set of all cells V 
Output: placement P 



1. 


T=To 


II set initial temperature 


2. 


P=PUKCE{V) 


// arbitrary initial placement 


3. 


while (7 > Tmin) 




4. 


while (ISTOPO) 


// not yet in equilibrium at 7 


5. 


new P = PERTURB(P) 




6. 


Acost = COSJ{new_P) - COST(P) 




7. 


If (Acost < 0) 


// cost improvement 


8. 


P = new_P 


// accept new placement 


9. 


else 


// no cost improvement 


10. 


r=RANDOM(0,1) 


// random number [0,1 ) 


11. 


jf^^^g-AcostfTj 


// probabilistically accept 


12. 


P = new P 




13. 


T=a- T 


// reduce 7, < a < 1 



From an initial placement (line 2), the PERTURB function generates a new 
placement new_P by perturbing the current placement (line 5). Acost records the 
cost difference between the previous placement P and new_P (line 6). If the cost 
improves, i.e., Acost < 0, then new_P is accepted (lines 7-8). Otherwise, new_P is 
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probabilistically accepted (lines 9-12). Once the annealing process has "reached 
equilibrium" (e.g., after a prescribed number of move attempts) at the current 
temperature, the temperature is decreased. This process continues until T< T„j„. 

TimberWolf. One of the earliest academic packages, TimberWolf, was developed at 
the University of California, Berkeley by C. Sechen and later commercialized [4.28] 
[4.29]. From the widths of all cells in the netlist, the algorithm produces an initial 
placement with all cells in rows, along with a target row length. TimberWolf allots 
additional area around any macro cells to allow sufficient interconnect area, and uses 
a specialized pin assignment algorithm to optimize any macro-cell pins that are not 
fixed to specific locations. The macro cells are then placed simultaneously along 
with the standard cells. The original TimberWolf (v. 3. 2) only optimizes the 
placement of standard cells while I/Os and macros stay in their original locations. 

The placement algorithm consists of the following three stages. 

1 . Place standard cells with minimum total wirelength using simulated annealing. 

2. Globally route (Chap. 5) the placement by introducing routing channels, when 
necessary. Recompute and minimize the total wirelength. 

3 . Locally optimize the placement with the goal of minimizing channel height. 

The remainder of this section discusses the first (placement) stage. For fiirther 
reading on TimberWolf, see [4.29]. 

Perturb. The PERTURB function generates a new placement from an existing 
placement using one of the following actions. 

— MOVE: Shift a cell to a new position (another row) 

— SWAP: Exchange two cells 

— MIRROR: Reflect the cell orientation aroimd the j-axis, used only if MOVE 

and SWAP are infeasible 

The scope of PERTURB is limited to a small window of size wt ^ hr (Fig. 4.10). 
For MOVE, a cell can only be moved within this window. 



[hA 



Fig. 4.10 Window with tlie dimensions Wr x hj 
around a standard cell c. 



For SWAP, two cells a (xaya) and b (xj yt) can be exchanged only if 

\Xa-Xh\<WT and [y^-Jftl < /Jr 
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The window size (wtj/jt) depends on the current temperature T, and decreases as 
temperature reduces. The window size for the next iteration is based on the current 
temperature T"™,, and the next iteration's temperature T„exi using 

\og{T^^^,) \og{T ) 

Wr - Wt — and hr = K — 

-' iifrt ^ run- 1 /T^ \ ^ npr! -' riirr 1 /T^ \ 

iog(7;„„.) iog(7;„^,) 

Cost. The COST function in TimberWolf (v3.2) is defined as F = Fi + r2 + F3, the 
sum of three parameters - (1) total estimated wirelength Fi, (2) amount of overlap F2, 
and (3) row inequality length F3. 

Fi is computed as the summation of each net's half-perimeter wirelength (HPWL), 
which is defined as its horizontal span plus its vertical span. Weights for each 
direction, horizontal weight wh and vertical weight wy, can also be applied. Given a 
priority weight yi, Fi is defined as the sum of the total wirelength over all nets net e 
Netlist, where Netlist is the set of all nets. 



Tj = Yj • ^ w^ {net) ■ x„^, + Wy {net) ■ j,, 



net^Netlist 

A higher weight value for net gives higher emphasis on reducing «efs wirelength. 
Weights can also be used for direction control - giving preference to a certain wiring 
direction. During standard-cell placement where feedthrough cells are limited, low 
horizontal weights Wj^net) encourage the usage of horizontal channels rather than 
the vertical connections. 

F2 represents the total cell overlap of the placement. Let o{ij) represent the area of 
overlap between cells / and 7. Given a priority weight J2, F2 is defined as the sum of 
the square of all cell overlaps between cells / and 7, where i e VJ g V, i 7^7, with V 
being the set of all cells. 



r2=y2- ^o{i,j)^ 



isVj'eVJ^J 

Larger overlaps, which require more effort to correct, are penalized more heavily 
due to the quadratic form. 

F3 represents the cost of all row lengths L{row) that deviate from the goal length 
Lgpi{row) during placement. Cell movement can often lead to row length variation, 
where the resulting rows lengths deviate from the goal length. In practice, uneven 
rows can waste area and induce uneven wire distributions. Both phenomena can lead 
to increased total wirelength and total congestion. Given a priority factor 73, F3 is 
defined as the sum of row length deviation for all rows row e Rows, where Rows is 
the set of all rows. 
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Ts = y 3 • X |'^('"°^) " ^opt irow)\ 



row^Rows 



Temperature Reduction. The temperature T is reduced by a cooling factor a. This 
value is empirically chosen and often depends on the temperature range. The 
annealing process starts at a high temperature, such as 4- 10* (units do not play a 
role). Initially, the temperature is reduced quickly (a « 0.8). After a certain number 
of iterations, the temperature reduces at a slower rate (a » 0.95), when the placement 
is being fine-tuned. Toward the end, the temperature is again reduced at a fast pace 
(a « 0.8), corresponding to a "quenching" step. TimberWolf finishes when T< T,„i„ 
where r„„„=l. 

Number of Times Through the Inner Loop. At each temperature, a number of calls 
are made to PERTURB to generate new placements. This number is intended to 
achieve equilibrium at the given temperature, and depends on the size of the design. 
The authors of [4.29] experimentally determined that designs with -200 cells require 
100 iterations per cell, or roughly 2- 10"* runs per temperature step. Other simulated 
annealing approaches use acceptance ratio as an equilibrium criterion, e.g.. Lam 
[4.20] shows that a target acceptance ratio of 44% produces competitive results. 

> 4.3.4 Modern Placement Algorithms 

Algorithms for global placement have been studied by many researchers since the 
late 1980s, and the prevailing paradigm has changed several times to address new 
challenges arising in commercial chip designs [4.6] [4.22]. This section reviews 
modem algorithms for global placement, while the next section covers legalization 
and detailed placement, as well as the need for such a separation of concerns. 
Timing-driven placement is discussed in Sec. 8.3. 

The global placement algorithms in use today can handle extremely large netlists 
using analytic techniques, i.e., by modeling interconnect length with mathematical 
functions and optimizing these functions with numerical methods. Dimensions and 
sizes of standard cells are initially ignored to quickly find a seed placement, but are 
then gradually factored into the placement optimization so as to avoid uneven 
densities or routing congestion. Two common paradigms are based on quadratic and 
force-directed placement, and on nonlinear optimization. The former was introduced 
earlier and seeks to approximate wirelength by quadratic functions, which can be 
minimized by solving linear systems of equations (Sec. 4.3.2). The latter relies on 
more sophisticated functions to approximate interconnect length, and requires more 
sophisticated numerical optimization algorithms [4.4][4.16][4.17]. 

Of the two types, quadratic methods are easier to implement and appear to be more 
scalable in terms of runtime. Nonlinear methods require careful tuning to achieve 
numerical stability and often run much slower than quadratic methods. However, 
nonlinear methods can better account for the shapes and sizes of standard cells and 
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especially macro blocks, whereas quadratic placement requires dedicated spreading 
techniques. Both placement techniques are often combined with netlist clustering to 
reduce runtime [4.4][4.16][4.17][4.34], in a manner that is conceptually similar to 
multilevel partitioning (Chap. 2). However, the use of clustering in placement often 
leads to losses in solution quality. Thus, it is an open question whether the multilevel 
approach can outperform the flat approach in terms of runtime and solution quality 
for placement, as is the case for partitioning [4.5]. 

The aspects of quadratic placement that appear most impactfiil in practice are (1) the 
representation of multi-pin nets by sets of graph edges (net modeling), (2) the choice 
of algorithms for spreading, and (3) the strategy for interleaving spreading with 
quadratic optimization. Two common net models include cliques, where every pair 
of pins is connected by an edge with a small weight, and stars, where every pin is 
connected to a "star-point" that represents the net (or hyperedge) itself [4.12]. Edges 
representing a net are given fractional weights that add up to the net's weight (or to 
unity). For nets with fewer pins, cliques are preferred because they do not introduce 
new variables. For larger nets, stars are usefiil because they entail only a linear 
number of graph edges [4.34]. The star-point can be movable or placed in the 
centroid (average location or barycenter) of its neighbors. The latter option is 
preferred in practice because (1) it corresponds to the optimal location of the 
star-point in quadratic placement, and (2) it saves two {x,y) variables. Some placers 
additionally use a linearization technique that assigns a constant weight 



to each quadratic terai (x, - x,)^ within the objective fiinction. The weight w{ij) has 
the effect of turning each such squared wirelength term into a linear wirelength term, 
and can therefore be truer to an underlying linear wirelength objective. These 
weights are treated as constants, and then updated between rounds of quadratic 
optimization. A more accurate placement-dependent net model is proposed in [4.32]. 

Spreading is based on estimates of cell density in different regions of the chip. These 
estimates are computed by allocating movable objects into bins of a regular grid, and 
comparing their total area to available capacity per bin. Spreading can be performed 
after quadratic optimization using a combination of sorting by location and 
geometric scaling [4.32]. For example, cells in a dense region may be sorted by their 
x-coordinates and then re-placed in this order, so as to avoid overlaps. An implicit 
spreading method to reduce overlap is to enclose a set of cells in a rectangular region 
and then perform linear scaling [4. 1 8]. 

Spreading can also be integrated directly into quadratic optimization by adding 
spreading forces that push movable objects away from dense regions. These 
additional forces are modeled by imaginary fixed pins (anchor points) and imaginary 
wires pulling individual standard cells toward fixed pins [4.11]. This integration 
allows conventional quadratic placement to trade interconnect minimization for 
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smaller overlaps between modules. FastPlace [4.34] first perfomis simple geometric 
scaling and then uses the resulting locations as anchor points during quadratic 
optimization. These steps of spreading and quadratic optimization are interleaved in 
FastPlace to encourage spreading that does not conflict with interconnect 
optimization. Researchers have also sought to develop spreading algorithms that are 
sufficiently accurate to be invoked only once after quadratic optimization [4.35]. 

Analytic placement can be extended to optimize not only interconnect length, but 
also routing congestion [4.31]. This requires wiring congestion estimation, which is 
similar to density estimation and is also maintained on a regular grid. Congestion 
information can be used in the same ways as density estimation to perform 
spreading. Some researchers have developed post-processors to improve congestion 
properties of a given placement [4.21]. 

Several modem placers are available tree of charge for research purposes. As of 
2010, the most accessible placers are APlace [4.16][4.17], Capo [4.2][4.26], 
FastPlace 3.0 [4.34], mPL6 [4.4], and simPL [4.18]. All except simPL^ are equipped 
with legalizers and detailed placers so as to produce legal and highly optimized 
solutions. mPL6 is significantly slower than FastPlace, but finds solutions with 
smaller total interconnect length. Capo, a min-cut placer, is available in C++ source 
code. Its runtime is between that of mPL6 and FastPlace, but in many cases it 
produces solutions that are inferior to FastPlace solutions in terms of total 
interconnect length. However, for designs where achieving routability is difficult. 
Capo offers a better chance to produce a routable placement. It is also competitive 
on smaller designs (below 50,000 movable objects), especially those with high 
density and many fixed obstacles. 



ii 4.4 Legalization and Detailed Placement 



Global placement assigns locations to standard cells and larger circuit modules, e.g., 
macro blocks. However, these locations typically do not align with power rails, and 
may have continuous (real) coordinates rather than discrete coordinates. Therefore, 
the global placement must be legalized. The allowed legal locations are equally 
spaced within pre-defined rows, and the point-locations from global placement 
should be snapped to the closest possible legal locations (Fig. 4. 1 1). 

Legalization is necessary not only after global placement, but also after incremental 
changes such as cell resizing and buffer insertion during physical synthesis (Sec. 
8.5). Legalization seeks to find legal, non-overlapping placements for all placeable 
modules so as to minimize any adverse impact on wirelength, timing and other 
design objectives. Unlike algorithms for "cell spreading" during global placement 
(Sec. 4.3), legalization typically assumes that the cells are distributed fairly well 



simPL uses FastPlace-DP [4.23] for both legalization and detailed placement. 
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throughout the layout region and have relatively small mutual overlap. Once a legal 
placement is available, it can be improved with respect to a given objective by 
means of detailed placement techniques, such as swapping neighboring cells to 
reduce total wirelength, or sliding cells to one side of the row when unused space is 
available. Some detailed placers target routability, given that route topologies can be 
determined once a legal placement is available. 




Plilil 




Legal positions of INV, NAND, NOR 
cells between VDD and GND rails 



Fig. 4.11 An instance of detailed placement. Between eacli VDD and GND rail, cells must be placed 
in non-overlapping site-aligned locations. 

One of the simplest and fastest techniques for legalization is Tetris by D. Hill [4. 10]. 
It sorts cells by x-coordinate and processes them greedily. Each cell is placed at the 
closest available legal location that does not exceed the row capacity. In its purest 
form, the Tetris algorithm has several known drawbacks, one being its ignorance of 
the netlist. Another drawback is that in the presence of large amounts of whitespace, 
the Tetris algorithm can place cells onto one side of the layout, and thus 
considerably increase wirelength or circuit delay if some I/O pads are fixed on the 
opposite side of the layout. 

Several variations of the Tetris algorithm exist. One variant subdivides the layout 
into regions and performs Tetris legalization in each region: this prevents the large 
displacements observed with the original version. Other variants try to account for 
wirelength while finding the best legal location for a given cell [4.10]. There are also 
versions that inject unused space between legalized cells, so as to avoid routing 
congestion [4.21]. 



Some algorithms for legalization and placement are co-developed with global 
placement algorithms. For instance, in the context of min-cut placement, detailed 
placement can be performed by optimal partitioners and end-case placers [4.1][4.3] 
invoked in very small (end-case) bins that are produced after the netlist is repeatedly 
partitioned. Given that these end-case bins contain a very small number of cells 
(four to six), optimal locations can be found by exhaustive enumeration or 
branch-and-bound. For larger bins, partitioning can be performed optimally (with up 
to ~35 cells in a bin). Some analytic algorithms perform legalization in iterations 
[4.32]. At each iteration, cells closest to legal sites are identified and snapped to 
legal sites, then considered fixed thereafter. After a round of analytic placement. 
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another group of cells is snapped to legal sites, and the process continues until all 
cells have been given legal locations. 

A common problem with simple and fast legalization algorithms is that some cells 
may travel a long distance, thus significantly increasing the length and, hence, delay 
of incident nets. This phenomenon can be mitigated by detailed placement. For 
example, optimal branch-and-bound placers [4.3] can reorder groups of neighboring 
cells in a row. Such groups of cells are often located in a sliding window; the 
optimal placer reorders cells in a given window so as to improve total wirelength 
(accounting for connections to cells with fixed locations outside the window). 

A more scalable optimization splits the cells in a given window into left and right 
halves, and optimally interleaves the two groups while preserving the relative order 
of cells from each group [4.12]. Up to 20 cells per window can be interleaved 
efficiently during detailed placement, whereas branch-and-bound placement can 
typically handle only up to eight cells [4.3]. These two optimizations can be 
combined for greater impact. 

Sometimes, wirelength can be improved by reordering cells that are not adjacent. 
For example, pairs of non-adjacent cells connected by a net can be swapped [4.23], 
and sets of three such cells can be cycled. Yet another detailed placement 
optimization is possible when unused space is available between cells placed in a 
row. These cells can be shifted to either side, or to intermediate locations. Optimal 
locations to minimize wirelength can be found by a polynomial-time algorithm 
[4. 15], which is practical in many applications. 

Software implementations of legalization and detailed placement are often bundled, 
but are sometimes independent of global placement. One example is FastPlace-DP 
[4.23] (binary available from the authors). FastPlace-DP works best when the input 
placement is almost legal or requires only a small number of local changes. 
FastPlace-DP performs a series of simple but efficient incremental optimizations 
which typically decrease interconnect length by several percent. On the other end of 
the spectrum is ECO-System [4.25]. It is integrated with the Capo placer [4.2][4.26] 
and uses more sophisticated yet slower optimizations. ECO-System first analyzes a 
given placement and identifies regions where cells overlap so much that they need to 
be re-placed. The Capo algorithm is then applied simultaneously to each region so as 
to ensure consistency. Capo integrates legalization and detailed placement into 
global min-cut placement. Therefore, ECO-System will produce a legal placement 
even if the initial placement requires significant changes. 

Other strategies, such as the use of linear programming [4.7] and dynamic 
programming [4.14], have been integrated into legalization and detailed placement 
with promising results. The legalization of mixed-size netlists that contain large 
movable blocks is particularly challenging [4.14]. 
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Exercise 1: Estimating Total Wirelengtli 

Consider the five-pin net with pins a-e (right). Each grid 
edge has unit length. 

(a) Draw a rectihnear minimum- length chain, a rectilinear 
minimum spanning tree (RMST), and a rectilinear Steiner 
minimum tree (RSMT) to connect all pins. 

(b) Find the weighted total wirelength using each estimation 
technique from (a) if each grid edge has weight = 2. 
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Exercise 2: Min-Cut Placement 

Perform min-cut placement to place gates a-g on a 2 x 4 
grid. Use the Kernighan-Lin algorithm for partitioning. 
Use alternating (horizontal and vertical) cutlines. The 
cutline cuti represents the initial vertical cut. Each edge on 
the grid has capacity Op(e) = 2. Estimate whether the 
placement is routable. 




Exercise 3: Force-Directed Placement 

A circuit with two gates a and b and three I/O pads Inl (0,2), In2 (0,0) and Out (2,1) 
is given (left). The weights of the connections are shown below. Calculate the ZFT 
positions of the two gates. Place the circuit on a 3 x 3 grid (right). 
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Exercise 4: Global and Detailed Placement 

What are the main differences between global and detailed placement? Explain why 
the global and detailed placement steps are performed separately. Explain why 
detailed placement follows global placement. 
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5 Global Routing 



During global routing, pins with the same electric potential are connected using wire 
segments. Specifically, after placement (Chap. 4), the layout area is represented as 
routing regions (Sec. 5.4) and all nets in the netlist are routed in a systematic manner 
(Sec. 5.5). To minimize total routed length, or optimize other objectives (Sec. 5.3), 
the route of each net should be short (Sec. 5.6). However, these routes often compete 
fox the same set of limited resources. Such conflicts can be resolved by concurrent 
routing of all nets (Sec. 5.7), e.g., integer linear programming (ILP), or by 
sequential routing techniques, e.g., rip-up and reroute. Several algorithmic 
techniques enable scalability of modern global routers (Sec. 5.8). 



5.1 Introduction £J. 



A net is a set of two or more pins that have the same electric potential. In the final 
chip design, they must be connected. A typical /»-pin net connects one output pin of 
a gate and/» - 1 input pins of other gates; its fanout is equal top- 1. The term netlist 
refers collectively to all nets. 

Given a placement and a netlist, determine the necessary wiring, e.g., net topologies 
and specific routing segments, to connect these cells while respecting constraints, 
e.g., design rules and routing resource capacities, and optimizing routing objectives, 
e.g., minimizing total wirelength and maximizing timing slack. 

In area-limited designs, standard cells may be packed densely without unused space. 
This often leads to routing congestion, where the shortest routes of several nets are 
incompatible because they traverse the same tracks. Congestion forces some routes 
to detour; thus, in congested regions, it can be difficult to predict the eventual length 
of wire segments. However, the total wirelength cannot exceed the available routing 
resources, and in some cases the chip area must be increased to ensure successful 
routing. Fixed-die routing, where the chip outline and all routing resources are fixed, 
is distinguished from variable-die routing, where new routing tracks can be added as 
needed. For the fixed-die routing problem,' 100% routing completion is not always 
possible a priori, but may be possible after changes to placement. On the other hand, 
in older standard-cell circuits with two or three metal layers, new tracks can be 
inserted as needed, resulting in the classical variable-die channel routing problem for 
which 100% routing completion is always possible. Fig. 5.1 outlines the major 
categories of routing algorithms discussed in this book. 



Thejixed-die routing problem is so named because the layout boimding box and the number of 
routing tracks are predetermined due to the fixed floorplan and power-groimd distribution. 



A. B. Kahng et al., VLSI Physical Design: From Graph Partitioning to Timing Closure, 
DOI I0.1007/978-90-481-9591-6_5, © Springer Science+Business Media B.V. 2011 
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Fig. 5.1 Routing problem types and the chapters in which they are discussed. 

With the scale of modem designs at millions of nets, global routing has become a 
major computational challenge. Full-chip routing is usually performed in three steps: 
(high-level) global routing, (low-level) detailed routing, and timing-driven routing. 
The first two steps are illustrated in Fig. 5.2; the last step is discussed in Sec. 8.4. 
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Fig. 5.2 Graphical representation of nets N\-Ni that are (coarsely) globally routed using routing 
regions (left), and then (finely) detailed routed using rectilinear paths (right). This example assumes 
two-layer routing, with horizontal and vertical segments routed on separate layers. 

During global routing, the wire segments used by net topologies are tentatively 
assigned (embedded) within the chip layout. The chip area is represented by a coarse 
routing grid, and available routing resources are represented by edges with 
capacities in a grid graph. Nets are then assigned to these routing resources. 



During detailed routing, the wire segments are assigned to specific routing tracks. 
This process involves a number of intermediate tasks and decisions such as net 
ordering, i.e., which nets should be routed first, and pin ordering, i.e., within a net, 
in what order should the pins be connected. These two issues are the major 
challenges in sequential routing, where nets are routed one at a time. The net and pin 
orderings can have dramatic impacts on final solution quality. Detailed routing seeks 
to refine global routes and typically does not alter the configuration of nets 
determined by global routing. Hence, if the global routing solution is poor, the 
quality of the detailed routing solution will likewise suffer. 
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To determine the net ordering, each net is given a numerical indicator of importance 
(priority), known as a net weight. High priority can be given to nets that are 
timing-critical, connect to numerous pins, or carry specific functions such as 
delivering clock signals. High-priority nets should avoid unnecessary detours, even 
at the cost of detouring other nets. Pin ordering is typically perfomied using either 
tree-based algorithms (Sec. 5.6. 1) or geometric criteria based on pin locations. 

Specializing routing into global and detailed stages is common for digital circuits. 
For analog circuits, multi-chip modules (MCMs), and printed circuit boards (PCBs), 
global routing is sometimes unnecessary due to the smaller number of nets involved, 
and only detailed routing is performed. 



5.2 Terminology and Definitions ££. 



The following terms are relevant to global routing in general. Terms pertaining to 
specific algorithms and techniques will be introduced in their respective sections. 

A routing track {column) is an available horizontal (vertical) wiring path. A signal 
net often uses a sequence of alternating horizontal tracks and vertical columns, 
where adjacent tracks and columns are connected by inter-layer vias. 

A routing region is a region that contains routing tracks and/or columns. 

A uniform routing region is formed by evenly spaced horizontal and vertical grid 
lines that induce a uniform grid over the chip area. This grid is sometimes referred to 
as a ggrid (global grid); it is composed of unit gcells (global cells). Grid lines are 
typically spaced seven to 40 routing tracks [5.18] apart to balance the complexities 
of the chip-scale global routing and gcell-scale detailed routing problems. 

A non-uniform routing region is formed by horizontal and vertical boundaries that 
are aligned to external pin connections or macro-cell boundaries. This results in 
channels and switchhoxes - routing regions that have differing sizes. During global 
routing, nets are assigned to these routing regions. During detailed routing, the nets 
within each routing region are assigned to specific wiring paths. 

A channel is a rectangular routing region with pins on two opposite (usually the 
longer) sides and no pins on the other (usually the shorter) sides. There are two types 
of channels - horizontal and vertical. 

A horizontal channel is a channel with pins on the top and bottom boundaries (Fig. 

5.3). 

A vertical channel is a channel with pins on the left and right boundaries (Fig. 5.4). 
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The pins of a net are connected to the routing channel by columns, which in turn are 
connected to other columns by tracks. In older, variable-die routing contexts, such as 
two-layer standard-cell routing, the channel height is flexible, i.e., its capacity can be 
adjusted to accommodate the necessary amount of wiring. However, due to the 
increased number of routing layers in modern designs, this traditional channel model 
has largely lost its relevance. Instead, over-the-cell (OTC) routing is used, as 
discussed further in Sec. 6.7. 



The channel capacity represents the number of available routing tracks or columns. 
For single-layer routing, the capacity is the height h of the channel divided by the 
pitch dpitch, where dpuch is the minimum distance between two wiring paths in the 
relevant (vertical or horizontal) direction (Fig. 5.3). For multilayer routing, the 
capacity o is the sum of the capacities of all layers. 



a{Layers) = / 

layereLayers 



pitch 



{layer) 



Here, Layers is the set of all layers, and dpi,ci,(layer) is the routing pitch for layer. 
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Fig. 5.3 Channel routing for a horizontal channel. 

A switchbox (Fig. 5.4) is the intersection of horizontal and vertical channels. Due to 
fixed dimensions, switchbox routing exhibits less flexibility and is more difficult 
than channel routing. Note that since the entry points of wires are fixed, the problem 
is one of finding routes inside the switchbox. 
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Fig. 5.4 Switchbox routing between horizontal and vertical channels of a macro-cell circuit. 
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A 2D switchbox (Fig. 5.5) is a switchbox with terminals on four boundaries (top, 
bottom, left, and right). This model is primarily used with two-layer routing, where 
interlayer connections (vias) are relatively insignificant. 

A 3D switchbox (Fig. 5.5) is a switchbox with terminals on all six boundaries (top, 
bottom, left, right, up, and down), allowing paths to travel between routing layers. 
Four of the sides are the same as those of the 2D switchbox, one goes to the layer 
above, and one goes to the layer below. 
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Fig. 5.5 2D and 3D switchboxes in a five-layer process. The cells are mtemally routed using layers 
up to Metal3. 2D switchboxes typically exist on layers Metall, Metal2 and MetalS. Layers Metal4 
and Metal5 are connected by a 3D switchbox. 

A T-junction occurs when a vertical channel meets a horizontal channel, e.g., in a 
macro-cell placement. In the example of Fig. 5.6, both (1) the height of the vertical 
channel and (2) the locations of the fixed pin connections for the horizontal channel 
are determined only after the vertical channel has been routed. Therefore, the 
vertical channel must be routed before the horizontal channel. 
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Fig. 5.6 Routing for a T-junction in a macro-cell placement. 
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5.3 



5.3 Optimization Goals 



Global routing seeks to (1) detennine whether a given placement is routable, and (2) 
determine a coarse routing for all nets within available routing regions. In global 
routing, the horizontal and vertical capacities of a given routing region are, 
respectively, the maximum number of horizontal and vertical routes that can traverse 
that particular region. These upper bounds are determined by the specific 
semiconductor technology and its corresponding design rules, e.g., track pitches, as 
well as layout features. As the subsequent detailed routing will only have a fixed 
number of routing tracks, routing regions in global routing should not be 
oversubscribed. Optimization objectives addressed by routing include minimizing 
total wirelength and reducing signal delays on nets that are critical to the chip's 
overall timing. 

Full-custom design. A full-custom design is a layout that is dominated by macro 
cells and wherein routing regions are non-uniform, often with different shapes and 
heights/widths.^ In this context, two initial tasks, channel definition and channel 
ordering, must be performed. The channel definition problem seeks to divide the 
global routing area into appropriate routing channels and switchboxes. 

To determine the types of channels and channel ordering, the layout region is 
represented by a floorplan tree (Sec. 3.3), as illustrated in Fig. 5.7. Divisions or cut 
lines in the layout, corresponding to routing channels, can be ordered according to a 
bottom-up {e.g., post-order) traversal of the internal nodes of the floorplan tree. This 
determines the routing sequence of the channels. 
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Fig. 5.7 Finding channels and their routing sequence using a floorplan tree. The channel ordering 
from 1 to 5 is determined from a bottom-up traversal of the internal nodes. 



■ The characteristics of a full-custom design differ among companies. 
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If (a part of) a floorplan is non-slicing, i.e., cannot be represented completely with 
only horizontal and vertical cuts, then at least one switchbox must be used, in 
addition to channels. Once all of the regions and their respective channels or 
switchboxes have been determined, the nets can be routed. Methods for routing nets 
include Steiner tree routing and minimum spanning tree routing (Sec. 5.6.1), as well 
as shortest-path routing using Dijkstra's algorithm (Sec. 5.6.3). 

hi full-custom, variable-die designs, the dimensions of routing regions are typically 
not fixed. Hence, the capacity of each routing region is not known a priori. In this 
context, the standard objective is to minimize the total routed length and/or minimize 
the length of the longest timing path. On the other hand, in the context of fixed-die 
placement, where the capacities of routing regions are constrained by hard upper 
bounds, routing optimization is often performed subject to a routability constraint. 

Standard-cell design. In standard-cell designs, if the number of metal layers is 
limited, feedthrough cells must be used to route a net across multiple cell rows. 
Instantiating a feedthrough cell essentially reserves an empty vertical column for a 
net. Fig. 5.8 illustrates the use of feedthrough cells in the routing of a five -pin net. 
When feedthrough cells in consecutive rows are used during the routing of a given 
net, these cells must align so that their reserved tracks align. 




Feedthrough Cells 



Fig. 5.8 The routing solution of tlie given net (cells in 
dark gray) with three feedthrough cells (white). Designs 
with more than three metal layers and over-the-cell 
routing typically do not use feedthrough cells. 



If the cell rows and the netlist are fixed, then the number of unoccupied sites in the 
cell rows is fixed. Therefore, the number of possible feedthroughs is limited. Hence, 
standard-cell global routing seeks to (1) ensure routability of the design and {2) find 
an uncongested solution that minimizes total wirelength. 



If a net is contained within a single routing region, then routing this net entails either 
channel or switchbox routing and can be achieved entirely within that region, except 
when significant detours are required due to routing congestion. However, if a net 
spans more than one routing region, then the global router must split the net into 
multiple subnets, and then assign the subnets to the routing regions. For multi-pin 
routing, rectilinear Steiner trees (Sec. 5.6.1) are commonly used (Fig. 5.9). 
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Fig. 5.9 Two different rectilinear Steiner tree solutions for routing of a five-pin net. The solution on 
the left has the least wirelength, while the solution on the right uses the fewest vertical segments. 

The total height of a variable-die standard-cell design is the sum of all cell row 
heights (fixed quantity) plus all channel heights (variable quantities). Ignoring the 
small impact of feedthrough cells, minimizing the layout area is equivalent to 
minimizing the sum of channel heights. Thus, although the layout is primarily 
determined during placement, the global routing solution also influences the layout 
size. Shorter routes typically lead to a more compact layout. 

Gate-array designs. In gate-array designs, the sizes of the cells and the sizes of the 
routing regions between the cells (routing capacities) are fixed. Since routability 
cannot be ensured by the insertion of additional routing resources, key tasks include 
determining the placement's routability and finding a feasible routing solution. Like 
other design styles, additional optimization objectives include minimizing total 
routed wirelength and minimizing the length of the longest timing path. 
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Fig. 5.10 A gate-array design with channel height = 4 (left). If all nets are routed with their shortest 
paths, the fiill netlist is uni'outable. In one possible routing, net C is remains unrouted (right). 
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5.4 Representations of Routing Regions 



To model the global routing problem, the routing regions (e.g., channels or 
switchboxes) are represented using efficient data structures. Typically, the routing 
context is captured using a graph, where nodes represent routing regions and edges 
represent adjoining regions.^ For each prescribed connection, a router must 



" Capacities are associated with both edges and nodes to represent available routing resources. 
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determine a path within the graph that connects the terminal pins. The path can only 
traverse nodes and edges ofthe graph that have sufficient remaining routing 
resources. The following three graph models are commonly used. 

A grid graph is defined as ggrid = {V,E), where the nodes v e F represent the 
routing grid cells (gcells) and the edges represent connections of grid cell pairs (v„Vy) 
(Fig. 5. 11). The global routing grid graph is two-dimensional, but must represent k 
routing layers. Hence, k distinct capacities must be maintained at each node of the 
grid graph. For example, for a two-layer routing grid {k= 2), a capacity pair (3,1) for 
a routing grid cell can represent three horizontal segments and one vertical segment 
still available. Other capacity representations are possible. 
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Fig. 5.11 A layout and its corresponding grid graph. 

In a channel connectivity graph G = {V,E), the nodes v e F represent channels, and 
the edges E represent adjacencies ofthe channels (Fig. 5.12). The capacity of each 
channel is represented in its respective graph node. 
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Fig. 5.12 Sample layout with its corresponding channel connectivity graph. 



In a switchhox connectivity (channel intersection) graph G^ (V, E), the nodes v g V 
represent switchboxes. An edge exists between two nodes if the corresponding 
switchboxes are on opposite sides ofthe same channel (Fig. 5.13). In this graph 
model, the edges represent horizontal and vertical channels. 
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Fig. 5.13 Sample layout with its corresponding switchbox connectivity graph. 
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5.5 The Global Routing Flow 



step 1: defining the routing regions. In this step, the layout area is divided into 
routing regions. In some cases, nets can also be routed over standard cells (OTC 
routing). As discussed earlier, the routing regions are formed as 2D or 3D channels, 
switchboxes and other region types (Fig. 5.5). These routing regions, their capacities, 
and their connections are then represented by a graph (Sec. 5.4). 

Step 2: mapping nets to the routing regions. In this step, each net of the design is 
tentatively assigned to one or several routing regions so as to connect all of its pins. 
The routing capacity of each routing region limits the number of nets traversing this 
region. Other factors, such as timing and congestion, also affect the path chosen for 
each net. For example, routing resources can be priced differently in different 
regions with available routing capacity - the more congested a routing region, the 
higher the cost for any net subsequently routed through that region. Such resource 
pricing encourages subsequent nets to seek alternate paths, and results in a more 
uniform routing density. 

Step 3: assigning crosspoints. In this step, also known as midway routing, routes 
are assigned to fixed locations, or crosspoints, along the edges of the routing regions. 
Crosspoint assignment enables scaling of global and detailed routing to designs with 
millions of cells as well as distributed and parallel algorithms, since the routing 
regions can be handled independently in detailed routing (Chap. 6). 

Finding an optimal crosspoint assignment requires knowledge of net connection 
dependencies and channel ordering. For instance, the crosspoints of a switchbox can 
be fixed only after all adjacent channels have been routed. Thus, the locations will 
depend on local connectivity through the channels and switchboxes. Note that some 
global routers do not perform a separate crosspoint assignment step. Instead, implicit 
crosspoint assignment is integrated with detailed routing. 
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The following techniques for single-net routing are commonly used within larger 
full-chip routing tools. 

>■ 5.6.1 Rectilinear Routing 

Multi-pin nets - nets with more than two pins - are often decomposed into two-pin 
subnets, followed by point-to-point routing of each subnet according to some 
ordering. Such net decomposition is performed at the beginning of global routing 
and can affect the quality of the final routing solution. 

Rectilinear spanning tree. A rectilinear spanning tree connects all temiinals (pins) 
using only pin-to-pin connections that are composed of vertical and horizontal 
segments. Pin-to-pin connections can meet only at a pin, i.e., "crossing" edges do 
not intersect, and no additional junctions (Steiner points) are allowed. If the total 
length of segments used to create the spanning tree is minimal, then the tree is a 
rectilinear minimum spanning tree (RMST). An RMST can be computed in 0(p^) 
time, where p is the number of terminals in the net, using methods such as Prim's 
algorithm [5.19]. This algorithm builds an MST by starting with a single terminal 
and greedily adding least-cost edges to the partially-constructed tree until all 
terminals are connected. Advanced computational-geometric techniques reduce the 
runtime to 0{p logp). 

Rectilinear Steiner tree (RST). A rectilinear Steiner tree (RST) connects all p pin 
locations and possibly some additional locations (Steiner points). While any 
rectilinear spanning tree for a /7-pin net is also a rectilinear Steiner tree, the addition 
of carefully-placed Steiner points often reduces the total net length.'* An RST is a 
rectilinear Steiner minimum tree (RSMT) if the total length of net segments used to 
connect all p pins is minimal. For instance, in a uniform routing grid, let a unit net 
segment be an edge that connects two adjacent gcells; an RST is an RSMT if it has 
the minimum number of unit net segments. 

The following facts are known about RSMTs. 

— An RSMT for a/»-pin net has between and/» - 2 (inclusive) Steiner points. 

— The degree of any tenninal pin is 1,2, 3, or 4. The degree of a Steiner point is 
either 3 or 4. 

— A RSMT is always enclosed in the minimum bounding box (MBB) of the net. 

— The total edge length L^smt of the RSMT is at least half the perimeter of the 
minimum bounding box of the net: LjiSMT ^ Lmbb 1 2. 



In Manhattan routing, the comer of an Z-shape connection between two points is not considered a 
Steiner point. 
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Constructing RSMTs in the general case is NP-hard; in practice, heuristic methods 
are used. One fast technique, FLUTE, developed by C. Chu and Y. Wong [5.5], finds 
optimal RSMTs for up to nine pins and produces near-minimal RSTs, often within 
1% of the minimum length, for larger nets. Although RSMTs are optimal in terms of 
wirelength, they are not always the best choice of net topology in practice. 



Some heuristics do not deal well with obstacle avoidance or other requirements. 
Furthermore, the minimum wirelength objective may be secondary to control of 
timing or signal integrity on particular paths in the design. Therefore, nets may 
alternatively be connected using RMSTs, which can be computed in low-order 
polynomial time and are guaranteed to have no more than 1.5 times the net length of 
an RSMT. This worst-case ratio occurs, for example, with terminals at (1,0), (0,1), 
(-1,0) and (0,-1). The RSMT has a Steiner point at (0,0) and total net length = 4, 
while the RMST uses three length-2 edges with total net length = 6. Since RMSTs 
are computed optimally and quickly, many heuristic approaches, such as the one 
presented below, transform an initial RMST into a low-cost RSMT [5.10]. 



Example: RMSTs and RSMTs 

Given: rectilinear minimum spanning tree (RMST) (right). 

Task: transform the RMST into a heuristic RSMT. 

P\ 
Solution: 

The transformation rehes on the fact that for each two-pin connection, 

two different i-shapes may be formed, i-shapes that cause (more) 

overlap of net segments, and hence reduction of the total wirelength of 

the net, are preferred. Construct Z-shapes between points />i andp2- 

-T 1 

Construct another set of i-shapes between p2 and p^. This introduces 
the Steiner point 5i. 

Pu 

No further wirelength reduction is possible. The final tree, which is an 
RSMT, has Steiner point S and is shown on the right. 
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Hanan grid. The previous example showed that adding Steiner points to an RMST 
can significantly reduce the wirelength of the net. In 1966, M. Hanan [5.8] proved 
that in finding an RSMT, it suffices to consider only Steiner points located at the 
intersections of vertical and horizontal lines that pass through terminal pins. More 
formally, the Hanan grid (Fig. 5.14) consists of the lines x = Xp, y = }/„ that pass 
through each pin location {Xp,yp). The Hanan grid contains at most p candidate 
Steiner points, thereby greatly reducing the solution space for finding an optimal 
RSMT. 
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Fig. 5.14 Finding tlie Hanan grid and subsequently tiie Steiner points of an RSMT. 
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Defining the routing regions. For global routing with Steiner trees, the layout is 
usually divided into a coarse routing grid consisting of gcells, and represented by a 
graph (Sec. 5.4). The example in Fig. 5.15 assumes a standard-cell layout. Without 
explicitly defining channel height, the distances between the horizontal lines is used 
as an estimate. Based on this estimate, the distances between the vertical lines are 
chosen such that each grid cell has a 1:1 aspect ratio. All connection points within 
the grid cell are treated as if they are at the cell's midpoint. 
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Fig. 5.15 In the global layout of standard cells (left), standard cells (dark gray) need to be connected 
at their pins (center). These pins are assigned to grid cells for Steiner tree construction (right). 

Steiner tree construction heuristic. The following heuristic uses the Hanan grid to 
produce a close-to-minimal rectilinear Steiner tree. The heuristic greedily makes the 
shortest possible new connection consistent with a tree topology (recall Prim's 
minimum spanning tree algorithm), leaving as much flexibility as possible in the 
embedding of edges. For nets with up to four pins, this heuristic always produces 
RSMTs. Otherwise the resulting RSTs are not necessarily minimal. 

The heuristic (pseudocode on the next page) works as follows. Let P ' = P be the set 
of pins to consider (line 1). Find the closest (in terms of rectilinear distance) pin pair 
(Pa,Pb) inP' (line 2), add them to the tree T, and remove them fromP' (lines 3-6). If 
there are only two pins, then the shortest path connecting them is an Z,-shape (lines 
7-8). Otherwise, construct the minimum bounding box (MBB) between j3^ axid ps 
(line 10), and find the closest point pair (pmbb,Pc) between MBB(p4,/»s) and P', 
where Pmbb is the point located on MBB(pa,Pb), and pc is the pin in /" (line 12). If 
Pmbb is a pin, then add any Z-shape to T (lines 16-17). Otherwise, add the /--shape 
that Pmbb lies on (lines 18-19). Construct the MBB oi Pmbb and pc (line 20), and 
repeat this process until P' is empty (lines 11-20). Finally, add the remaining 
(unconnected) pin to T (lines 2 1-22). 
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Sequential Steiner Tree Heuristic 

Input: set of all pins P 

Output: heuristic Steiner minimum tree T{V,E) 



1. 


P'=P 


2. 


{Pa,Pb) = CLOSEST PAIR(P') 


3. 


ADD{V,Pa) 


4. 


ADD{V,Pb) 


5. 


remo\/e{p;pa) 


6. 


REMOVE(P',pe) 


7. 


if(P'==0) 


8. 


ADD(£,Z.-shape connecting pa and pe) 


9. 


else 


10. 


curr MBB = MBB{pa,Pb) 


11. 


while (P'#0) 


12. 


(Pmbb,Pc) = CLOSEST _Pk\R{curr_MBB,P') 


13. 


kDD{V,pMBB) 


14. 


ADD(\/,pc) 


15. 


REMOVE(P',pc) 


16. 


if (P/wes 6 P) 


17. 


ADD(L-shape connecting pmbb and pc) 


18. 


else 


19. 


ADD(£,L-shape that includes Pmbb) 


20. 


curr MBB = MBB{Pmbb,Pc) 


21. 


ADD{V,pc) 


22. 


ADD(£,L-shape connecting pmbb and pc) 



II closest pin pair 

// add Pa to T 

II add Pe to T 

II remove pa from P' 

// remove ps from P' 

// shortest path connecting 

// Pa and ps is any L-shape 

//MBBofp/^andps 

// closest point pair, one from 
// currMBB, one from P' 
// add Pmbb to T 
II add Pc to T 
II remove pc from P' 
// if P/wee is a pin, either 
// L-shape is shortest path 
// if pwee is not a pin, add 
// L-shape that pmbb is on 
//MBB of Pmss and Pc 
// connect 7 to remaining pin 
// with L-shape 



Example: Sequential Steiner Tree Heuristic 

Given: seven pinspi-pv and their coordinates (right). 

p,{Q,6) p2(\,5) P3(4,7) P4(5,4) Pi{6,2) ps (3,2) ^,(1,0) 

Task: construct a heuristic Steiner minimum tree using the 

sequential Steiner tree heuristic. 

Solution: 

P'= {PhP2,P3,P4,P5,P6,P7} 

Vinspi and/'2 are the closest pair of pins. Remove them fromP. 

P'= {P3,P4,P5,P6,P7} 

Construct the MBB of pi and pi, and find the closest point pair 
between MBB(pi,/»2) and P', which are respectively p^sa =Pa and 
Pc = Pi- Since Pa is not a pin, select the Z-shape that Pa lies on. 
Construct the MBB of Pa and;»3. Remove /)3 from P. 

P'= {PAPsPiPl} 

Find the closest point pair between MSBipaP^) and P', which are 
respectively pMsa = Pi and/>c =P4- Since/*/, is not a pin, select the 
-L-shape that j^j lies on. Construct the MBB of pi, and/>4. Remove 
/>4 from P '. 
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P'={Pi,P6,Pl} 

Find the closest point pair between MBB(p4,/>4) and P', which are 
respectively />jVffls = Pn waApc = Pi- Since /)4 is a pin, select either 
i-shape that connects pt and p^. In this example, the right-down 
i-shape is selected. Construct the MBB of pi, andj?;. Remove jJj 
from P. 

P'=\P6Pl} 

Find the closest point pair between MSBip^p^) and P ', which are 
respectively Pjv/55 =Pc andpc=P6- Since p^ is not a pin, select the 
i-shape that p^ lies on. Construct the MBB ofp^ and ps- Remove 
P(, from P '. 



P'={Pi} 

Find the closest point pair between MBB{p„ps) and P ', which are 
respectively />A/Bfl =P6 andpc =P7- Since />6 is a pin, select either 
i-shape that connects p^ and P(,. In this example, there is one 
shortest path between p^ andpe. Construct the MBB ofpf, andpj. 
Remove />7 from P '. 

P-=0 

Since there is only one pin unconnected (pj), select either Z-shape 

that connects Pf, and pj. In this example, the down-left Z-shape is 

selected. 
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Mapping nets to routing regions. After finding an appropriate Steiner tree 
topology for the net, the segments are mapped to the physical layout. Each net 
segment is assigned to a specific grid cell (Fig. 5.16). When making these 
assignments, horizontal and vertical capacities for each grid cell are also considered. 
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Fig. 5.16 Assignment of each segment of the Steiner tree generated for the net from Fig. 5.15 (left 
and center). The routing regions affected are shaded in light gray (right). 
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>■ 5.6.2 Global Routing In a Connectivity Graph 

Several global routing algorithms on connectivity graphs are based on the channel 
model introduced by Rothermel and Mlynski in 1983 [5.20]. This model combines 
switchboxes and channels (Sec. 5.4) and handles non-rectangular block shapes. It is 
suitable for full-custom design and multi-chip modules. Global routing in a 
connectivity graph is performed using the following sequence of steps. 

Global Routing In a Connectivity Graph 

Input: netllst Netlist, layout LA 

Output: routing topologies for each net in Netlist 

1 . RR= DEFINE_ROUTING_REGIONS(M) // define routing regions 

2. CG = DEFINE_CONNECTIVITY_GRAPH(RR) // define connectivity graph 

3. nets = NET_ORDERING(A/ef//sf) // determine net ordering 

4. ASSIGN_TRACKS(RR,/\/ef//sf) // assign tracl<s for all pin 

// connections in Netlist 



5. 


for(/= 1 to|nefs|) 




// consider each net 


6. 


net = nets[i\ 






7. 


FREE_TRACKS(nef) 




// free corresponding tracks 
// for nef s pins 


8. 


snefs=SUBNETS(nef) 




// decompose net into 
// two-pin subnets 


9. 


for(/'= 1 to |snefs|) 






10. 


snet = snets\]] 






11. 


spath = SHORTEST_ 


PATH(s/7ef,CG) 


// find shortest path for snet 
II in connectivity graph CG 


12. 


If (spaf/7 == 0) 




II if no shortest path exists, 


13. 


continue 




// do not route 


14. 


else 




// otherwise, assign snet to 


15. 


R0UTE(snef,spaf/7,CG) 


// the nodes of spath and 








// update routing capacities 



Defining the routing regions. The vertical and horizontal routing regions are 
formed by stretching the bounding box of cells in each direction until a cell or chip 
boundary is reached (Fig. 5.17). 

































Fig. 5.17 Given a layout area with macro blocks (left), the horizontal (center) and vertical (right) 
macro-cell edges are extended to form the routing regions. 
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Defining the connectivity grapli. In the connectivity graph representation (Fig. 
5.18), the nodes represent the routing regions, and an edge between two nodes 
indicates that those routing regions are connected (continuity of the routing region). 
Each node also maintains the horizontal and vertical capacities for its routing region. 
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Fig. 5.18 Routing regions represented by a connectivity graph. Each node maintains the horizontal 
and vertical routing capacities for its corresponding routing region. The edges between the nodes 
show the connectivity between routing regions. 

Determining the net order. The order in which nets are processed can be 
determined before or during routing. Nets can be prioritized according to criticality, 
number of pins, size of bounding box (larger means higher priority), or electrical 
properties. Some algorithms dynamically update priorities based on layout 
characteristics observed during the course of routing attempts. 

Assigning tracl« for all pin connections. For each pin pin, a horizontal track and a 
vertical track are reserved within pin's routing region. This step is necessary to 
ensure that pin can be connected. It also provides two major advantages. First, since 
track assignment of pin connections is performed before global routing, if the pins 
are not accessible, then the placement of cells must be adjusted. Second, track 
reservation prevents nets that are routed first from blocking pin connections that are 
used later. This gives the router a more accurate congestion map and enables more 
intelligent detouring for future nets [5.1]. 

Global routing of all nets. Each net is processed separately in a pre-determined 
order. The following steps are applied to each net. 



Net and/or subnet ordering: split the multi-pin net into two-pin subnets, then 
determine an appropriate order by sorting the pins of each subnet, e.g., in 
non-decreasing order with respect to the x-coordinates. 



148 5 Global Routing 



Track assignment in the connectivity graph: tracks are assigned using a 
maze-routing algorithm (Sec. 5.6.3), with the regions' remaining resources as 
weights. High congestion in a region will encourage paths to detour through regions 
with low congestion. 

Capacity update in the connectivity graph: after a route has been found, the 
capacities for each region are appropriately decremented at each corresponding node. 
Note that the horizontal and vertical capacities are treated separately. That is, a 
vertical (horizontal) wiring path does not affect the horizontal (vertical) capacity in 
the same region. 

Example: Global Routing in a Connectivity Graph 

Given: (1) nets A and B, (2) the layout region with routing regions and obstacles (left), and (3) 

the corresponding connectivity graph (right). 
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Task: route A and B using as few resources as possible. 



Solution: 

Route A before B. Note: track assignment of pin connections is omitted in this example. 
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After A is routed, route B. Note that B is detoured since the horizontal capacities of nodes 
(regions) 5 and 6 are 0. 
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Example: Determining Routability 

Given: (1) nets A and B, (2) the layout area with routing regions and obstacles (left), and (3) 

the corresponding connectivity graph (right). 

Task: determine the routability of^ and 5. 
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Solution: 

Net^ is first routed through nodes (regions) 4-5-6-7-10, which is a shortest path. After this 
assignment, the horizontal capacities of nodes 4-5-6-7 are exhausted, i.e., each horizontal 
capacity = 0. 
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The shortest path for net B was previously through nodes 4-5-6, but this path would now make 
these nodes' horizontal capacities negative. A longer (but feasible) path exists through nodes 
4-8-9-5-1-2-6. Thus, this particular placement is considered routable. 
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^ 5.6.3 Finding Sliortest Patlis with Dijl^stra's Algorithm 

Dijkstra 's algorithm [5.6] finds shortest paths from a specified source node to all 
other nodes in a graph with non-negative edge weights. This is often referred to as 
maze routing. The algorithm is useful for finding a shortest path between two 
specific nodes in the routing graph - i.e., from a source to a particular target. In this 
context, the algorithm terminates when a shortest path to a target node is found. 



Dijkstra's algorithm takes as input (1) a graph G(V,E) with non-negative edge 
weights W, (2) a source (starting) node s, and (3) a target (ending) node t. The 
algorithm maintains three groups of nodes - (1) unvisited, (2) considered, and (3) 
known. Group 1 contains the nodes that have not yet been visited. Group 2 contains 
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the nodes that have been visited but for which the shortest-path cost from the 
starting node has not yet been found. Group 3 contains the nodes that have been 
visited and for which the shortest path cost from the starting node has been found. 

Dijkstra's Algorithm 

Input: weighted graph G{V,E) with edge weights I/I/, source node s, target node t 

Output: shortest path path from s to t 

1 . groups = V II initialize groups 1 , 2 and 3 

2. groupi = groups = path = 

3. foreach (node node e groupi) 

4. parent[nocle] = UNKNOWN II parent of node is unknown, initial 

5. cost[s]lnode] = «> // cost from s to any node is maximum 

6. cosf[s][s] = // except the s-s cost, which is 

7. curr_node = s // s is the starting node 

8. MOVE(s,sfroupi, groups) // move s from Group 1 to Group 3 

9. while (currnode != f) // while not at target node f 

1 0. foreach (neighboring node node of curr_node) 

11. if {node e groups) II shortest path is already known 

12. continue 

13. trial_cost = costls][curr_node] + W[curr_node]lnode] 

14. if (node E groupi) // node has not been visited 

15. MO\/E{node,groupugroup2) // mark as visited 

1 6. cost[s][node] = trial_cost II set cost from s to node 

1 7. parent[node] = curr_node II set parent of node 

1 8. else if {trial_cost < cost[s][node]) II node has been visited and 

// new cost from s to node is lower 

1 9. cost[s][node] = trial_cost II update cost from s to node and 

20. parent[node] = curr_node II parent of node 

21 . curr_node = BESJ{group2) II find lowest-cost node in Group 2 

22. MO\/E{curr_node,group2,group3) II and move to Group 3 

23. while (currnode 1= s) // backtrace from t to s 

24. ADD(path,curr_node) II add curr_node to path 

25. curr_node = parentlcurr_node] II set next node as parent of curr_node 

First, all nodes are moved into Group 1 (line 1), while Groups 2 and 3 are empty 
(line 2). For each node node in Group 1, the cost of reaching node from the source 
node s is initialized to be infinite (oo), i.e., unknown, with the exception of s itself, 
which has cost 0. The parents of each node node in Group 1 are initially unknown 
since no node has yet been visited (lines 3-6). The source node s is set as the current 
node cuiTjiode (line 7), which is then moved into Group 3 (line 8). While the target 
node t has not been reached (line 9), the algorithm computes the cost of reaching 
each neighboring node node of currjiode from s. The cost of reaching node from s 
is defined as the cost of reaching currjiode from s plus the cost of the edge from 
cwTjiode to node (lines 10-13). The former value is maintained by the algorithm, 
while the latter value is defined by edge weights. 

If the neighboring node node's shortest-path cost has already been computed, then 
move onto another neighboring node (lines 1 1-12). If node has not been visited, i.e.. 
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is in Group 1 , the cost of reaching node from s is recorded and the parent of node is 
set to currjiode. (hnes 14-17). Otherwise, the current cost from s to node is 
compared to the new cost trial_cost. If trial_cost is lower than the current cost from 
s to node, then the cost and parent of node are updated (lines 18-20). After all 
neighbors have been considered, the algorithm selects the best or lowest-cost node 
and sets it to be the new ciirrjiode (line 21-22). 

Once t is found, the algorithm finds the shortest path by backtracing. Starting with t, 
the algorithm visits the parent oft, visits the parent of that node, and so on, until s is 
reached (lines 23-25). 

After the first iteration. Group 3 contains a non-source node that is closest (with 
respect to cost) to s. After the second iteration. Group 3 contains the node that has 
second-lowest shortest-path cost from s, and so on. Whenever a new node node is 
added to Group 3, the shortest path from S to node must pass only through nodes 
that are aheady in Group 3. This invariant guarantees a shortest-path cost. 

The key to the efficiency of Dijkstra's algorithm is the small number of cost updates. 
During each iteration, only the neighbors of the node that was most recently added 
to Group 3 {curr_node) need to be considered. Nodes adjacent to currnode are 
added to Group 2 if they are not already in Group 2. Then, the node in Group 2 with 
minimum shortest-path cost is moved to Group 3. Once the target node, i.e., the 
stopping point, has been added to Group 3, the algorithm traces back from the target 
node to the starting node to find the optimal (minimum-cost) path. 

Dijkstra's algorithm not only guarantees a shortest (optimal) path based on the given 
non-negative edge weights, but can also optimize a variety of objectives, as long as 
they can be represented by edge weights. Examples of these objectives include 
geometric distance, electrical properties, routing congestion, and wire densities. 



Example: Dijkstra 's Algorithm 

Given: graph with nine nodes a-i and edge weights 
(wi,W2) (right). 

Task: find the shortest path using Dijkstra's 
algorithm from source i' (node a) to target t (node 
h), where the path cost from node^ to node B is 

cost[A][B] = Zwi{A,B) + ZW2{A,B) 
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Solution: 

Only Groups 2 and 3 are listed, as Group 1 is directly encoded in the graph. The parent[«ofife] 
of each node node is its predecessor in the shortest path from s. Each node, with the exception 
ofs, has the following format. 

<parent of node> [node name] (Jj\>^{s,node)^W2{s,node)) 



For example, <a> \b] (8,6) means that node a is the parent of node b, and has costs (8,6). 



152 5 Global Routing 



Iteration 1 : currnode = a 

Add the starting node s = ato Group 3. 




Group 2 Group 3 



Find the cumulative path costs through the current node a to all its neighboring nodes {b and 

d). Add each neighboring node to Group 2, keeping track of its costs and parent. 

b: cost[s\[b], = 8 + 6 = lA,parent[b] = a 

d: cost[s][d] = 1 + 4 = 5,parent[d] = a 

Between b and d, d has the lower cost. Therefore, it is selected as the node to be moved from 

Group 2 to Group 3. 




Group 2 



Group 3 



<a> [b] (8,6) 
<a>\(f] (1.4)-, 



<a>[rf](l,4) 



Iteration 2: currjrode = d 

Compute the costs of all nodes that are adjacent to d but not in Group 3. If the neighboring 
node node is unvisited {node is in Group 1), then add it to Group 2 and set its costs and parent. 
Otherwise {node is in Group 2), update its costs and parent if its current cost is less than 
existing cost. The cost to node is defined as the minimum of (1) the existing path cost and (2) 
the path cost from s to d plus the edge cost from d to node. From Group 2, select the node 
with the least cost and move it to Group 3. 



(8,6) 



(1,4) 




Group 2 Group 3 



<c/>[i/](l,4) \ 



<rf>[e] (10,11) \ 
<rf>[g](9,12) ' 



<a>[d\{l,A) 
<a> [b] (8,6) 
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Iterations 3-6: Similar to Iteration 2. 
because the previous entry <d> [e] (10. 
,4)_ ^^ (8,8) 



In Iteration 3, the entry <h> [e] (10,12) is rejected 
1 1) is less. The end result is illustrated below. 




Retrace from t to i'. 
s 



e^ 



*> ^^^^ 



(8,6) 
(1,4) 





(9,7) 



(2,6) 



(2,8) 



(9,8) 



/ 



(3,2) 



(2,8) 



& 



(4,5) 
(3,3) ^ 



Group 2 Group 3 



<a> [b] (8,6) 
<a>M(l,4) 



<a>[rfl(l,4) 



<rf>[e] (10,11), 

<rf>[g](9,12),\ 

<h> [c] (9,10)-;. \ <a> [b] (8,6) 

<b> [e] (10,12) V -V 



<c> [/] (18,18) 


- \''<6> [c] (9,10) 


<e>[/] (12,19) 
<£>[/;] (12,19) 


\''<rf>[e] (10,11) 


<g> [/;] (12,14 -, 


'^<rf> [g] (9,12) 


~-^ <g> [h] (12,14) 


Group 2 


Group 3 



<fl>M(l,4) 
<rf>[e] (10,11) 
<rf> [^] (9,12) 



C 



'<a>\dl (1,4) 



<6> [c] (9,10) 
</» [e] (10,12) 



<a> [b] (8,6) 



<c>[/] (18,18) 



</)> [c] (9,10) 



<e>[/] (12,19) 
<£>[/;] (12,19) 



<rf>[e] (10,11) 



<g>[h] (12,14) 



{<d>)lg] (9,12) 

7 f > ■ < 

\(<g^[h] (12,14) 



Result: 

Optimal path a-d-g-h from s = atot = h with accumulated cost[a] (12, 14). 

,4)^ , (9,12)^ 



,GyM^ „ ^.^ , 



&y 
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Dijkstra's algorithm can be applied to graphs where nodes also have a cost of 
traversal. To enable this, each capacitated node of degree d is replaced by a d-clique 
(a complete subgraph with d nodes), with each clique edge having weight equal to 
the original node cost. The d edges incident to the original node are reconnected, one 
per each of the d clique nodes. Thus, any path traversing the original node is 
transformed into a path that traverses two clique nodes and one clique edge. 

► 5.6.4 Finding Shortest Paths with A* Search 

The A * search algorithm [5.9] operates similarly to Dijkstra's algorithm, but extends 
the cost function to include an estimated distance from the current node to the target. 
Like Dijkstra's algorithm. A* also guarantees to find a shortest path, if any path 
exists, as long as the estimated distance to the target never exceeds the actual 
distance (i.e., an admissibility or lower bound criterion for the distance function). As 
illustrated in Fig. 5.19, A* search expands only the most promising nodes; its 
best-first search strategy eliminates a large portion of the solution space that is 
processed by Dijkstra's algorithm. A distance estimate that is a tight (accurate) 
lower bound on actual distance can lead to significant runtime improvements over 
breadth-first search (BFS) and its variants, and over Dijkstra's shortest-path 
algorithm. Implementations of A* search may be derived from implementations of 
Dijkstra's algorithm by adding distance-to-target estimates: the priority of a node for 
expansion in A* search is based on the lowest sum of (Group 2 label in Dijkstra's 
algorithm) + (distance estimate, including vias, to the target node). 

A common variant is bidirectional A * search, where nodes are expanded from both 
the source and target until the two expansion regions intersect. Using this technique, 
the number of nodes considered can be reduced by a small factor. However, the 
overhead for bookkeeping, e.g., keeping track of the order in which the nodes have 
been visited, and the complexity of efficient implementation are challenging. In 
practice, this can negate the potential advantages of bidirectional search. 
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Fig. 5.19 An instance of shortest-path routing with source s, target t, and obstacles (solid black 
squares with 'O'). (a) BFS and Dijkstra's algorithm expand the nodes outward until t is found, 
exploring a total of 3 1 nodes, (b) A* search considers six nodes in the direction of t. When A* search 
is finished, it has considered only about a quarter as many nodes (six out of 31, in this example) that 
Dijkstra's algorithm does, (c) Bidirectional A* search expands nodes from both .v and t. In this 
example, bidirectional A* search has the same performance as unidirectional A* search. 
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In order to successfully route multiple nets, global routers must properly match nets 
with routing resources, without oversubscribing resources in any part of the chip. All 
signal nets are either routed simultaneously, e.g., using (integer) linear programming 
(Sec. 5.7.1), or sequentially, e.g., one net at a time (Sec. 5.6). When certain nets 
cause resource contention or overflow for routing edges, sequential routing requires 
multiple iterations. These iterations are performed by ripping up the nets that cause 
violations (Sec. 5.7.2) and rerouting them with fewer violations. The iterations 
continue until all nets are routed without violating capacities of routing-grid edges or 
until a timeout is exceeded. 



>■ 5.7.1 Routing by Integer Linear Programming 

A linear program (LP) consists of a set of constraints and an optional objective 
function. This function is maximized or minimized subject to these constraints. Both 
the constraints and the objective function must be linear. In particular, the 
constraints form a system of linear equations and inequalities. An integer linear 
program (ILP) is a linear program where every variable can only assume integer 
values. ILPs where all variables are binary are called 0-1 ILPs. (Integer) Linear 
programs can be solved using a variety of available software tools such as GLPK 
[5.7], CPLEX [5.13], and MOSEK [5.17]. There are several ways to formulate the 
global routing problem as an ILP, one of which is presented below. 

The ILP takes three inputs - (1) an ff x //routing grid G, (2) routing edge capacities, 
and (3) the netlist Netlist. For exploitation purposes, a horizontal edge is considered 
to run left to right - G{ij) ~ G{i+lJ) - and a vertical edge is considered to run 
bottom to top - G{ij) ~ G(iJ+l). 

The ILP uses two sets of variables. The first set contains k Boolean variables x„e,j, 
x„ei2, ■■■ , x„eii^, cach of which serves as an indicator for one of k specific paths or 
route options, for each net net e Netlist. If x„e,^ = 1, (respectively, = 0), then the route 
option nett is used (respectively, not used). The second set contains k real variables 
Wnrtp w»gf2' ■ ■ ■ ' ^"f'/t' ^^^^ °^ which represents a net weight for a specific route 
option for net e Netlist. This net weight reflects the desirability of each route option 
for net (a larger w„e, means that the route option net„, is more desirable - e.g., has 
fewer bends). With \Netlist\ nets, and k available routes for each net net e Netlist, the 
total number of variables in each set is k ■ \Netlist\. 

Next, the ILP formulation relies on two types of constraints. First, each net must 
select a single route (mutual exclusion). Second, to prevent overflows, the number 
of routes assigned to each edge (total usage) cannot exceed its capacity. The ILP 
maximizes the total number of nets routed, but may leave some nets unrouted. That 
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is, if a selected route causes overflow in the existing solution, then the route will not 
be chosen. If all routes for a particular net cause overflow, then no routes will be 
chosen and thus the net will not be routed. 



Integer Linear Programming (ILP) Global Routing Formulation 

Inputs: 

width M/and height H of routing grid G 
grid cell at location (i,j) in routing grid G 
capacity of horizontal edge G(/j) ~ G{i + 1 ,j) 
capacity of vertical edge G{i,j) ~ G{i,j + 1 ) 
netlist 



W,H 

G{ij) 

a(G(/,y)~G(/+1,y)) 
a(G(/,/)~G(/,y+1)) 
Netlist 
Variables: 



WneU, 



Wnet, 



k Boolean path variables for each net net e Netlist 
k net weights, one for each path of net net e Netlist 



Maximize: 



neteNetlist 



Subject to: 
Variable Ranges: 

Xnet-f, ■■■ , Xnetif S [0,1] 



W 



net, 



k ^neti, 



Vnet s Netlist 



Net Constraints: 

Xnet^ + ... +X„ef^^ 1 



Vnef E Netlist 



Capacity Constraints: 

X^™fi + ■■■+>< net, <a(G(i,j)~G(i,j + ^)) 



neteNetlist 



^Xnet, +--- + Xnet, <a{G{iJ)~G{i + Xj) 



nett^Netlist 



Vneti, that use G{iJ) ~ G{iJ + 1 ), 
0</<M/, 0<y<H-1 

Vnef;; that use G(/,y) ~ G(/ +1 ,j), 
0</<l/l/-1,0<y<H 



In practice, most pin-to-pin connections are routed using Z-shapes or straight wires 
(connections without bends). In this formulation, straight connections can be routed 
using a straight path or a [/-shape; non-straight connections can use both Z-shapes. 
For unrouted nets, other topologies can be found using maze routing (Sec. 5.6.3). 



ILP-based global routers include Sidewinder [5.12] and BoxRouter 1.0 [5.4]. Both 
decompose multi-pin nets into two-pin nets using FLUTE [5.5], and the route of 
each net is selected from two alternatives or left unselected. If neither of the two 
routes available for a net is chosen. Sidewinder performs maze routing to find an 
alternate route and replaces one of the unused routes in the ILP fonnulation. On the 
other hand, nets that were successfully routed and do not interfere with unrouted 
nets can be removed from the ILP formulation. Thus, Sidewinder solves multiple 
ILPs until no further improvement is observed. In contrast, BoxRouter LO 
post-processes the results of its ILP using maze-routing techniques. 
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Example: Global Routing Using Integer Linear Programming 
Given: (1) nets A-C, (2) an fF = 5 x //= 4 routing grid G, (3) o(e) = 1 
for all e e G, and (4) i-shapes have weight 1 .00 and Z-shapes have 
weight 0.99. The lower-left comer is (0,0). 
Task: write the ILP to route the nets in the graph to the right. 



C 



C 



Solution: 

For net ^, the possible routes are two i-shapes {Ai^^ and two Z-shapes (A^^^). 

Net Constraints: 

X.Af + Xa2 + XAj^ + Xa^ - 1 

Variable Constraints: 

0<x^j<l,0<x^2-l' 
0<x,3<l,0<x,^<l 

For net B, the possible routes are two i-shapes {BiJii) and one Z-shape (53). 



j I ^ 

1l..-^„,.,»,i ^ t ( 

f '^ 

a i o I 



A M3 

« 

1 1 o 



B-, 



B, 



B 



l£0 



Net Constraints: 

Xb^+Xb^+Xb^S 1 

Variable Constraints: 
0<xb^<1,0<Xb^<\, 



B ' B 0<xs3<l 

For net C, the possible routes are two Z-shapes (Ci,C2) and two Z-shapes {€3,0^) . 



II 




c,: 


C ' 


c, 





c. 



iC 



:C 



C 



Net Constraints: 

Xc\+Xc2+Xc3+Xc^< 1 

Variable Constraints: 

0<Xc,< 1,0<XC2< 1, 
0<Xc3< 1,0<Xq< 1 



Each edge must satisfy capacity constraints. Only non-trivial constraints are shown. 



ontal Edge Capacity Constraints: 






G(0,0)~G(1,0) 


xc,+xc, 


< 


o(G(0,0)~G(l,0)) = l 


G(1,0)~G(2,0) 


xc, 


< 


o(G(l,0)~G(2,0)) = l 


G(2,0) ~ G(3,0) 


XBi + Xb^ 


< 


a(G(2,0) ~ G(3,0)) = 1 


G(3,0) ~ G(4,0) 


Xb, 


< 


a(G(3,0) ~ G(4,0)) = 1 


G(0,1)~G(1,1) 


Xa^ + Xc^ 


< 


o(G(0,l)~G(l,l)) = l 


G(1,1)~G(2,1) 


Xa2 + Xa^ + Xq^ 


< 


0(G(1,1)~G(2,1)) = 1 


G(2,1)~G(3,1) 


Xb2 


< 


a(G(2,l)~G(3,l)) = l 


G(3,1)~G(4,1) 


XB2 + XB, 


< 


o(G(3,l)~G(4,l)) = l 


G(0,2)~G(1,2) 


Xa^ + Xc2 


< 


o(G(0,2)~G(l,2)) = l 


G(1,2)~G(2,2) 


Xa^ + Xc2+Xc3 


< 


a(G(l,2)~G(2,2)) = l 


G(0,3)~G(1,3) 


XAi +^,43 


< 


a(G(0,3)~G(l,3)) = l 


G(1,3)~G(2,3) 


Xa, 


< 


0(G(1,3)~G(2,3)) = 1 
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Vertical Edge Capacity Constraints: 










G(0,0)~G(0,1) 


^'C2+XC4 






< 


a(G(0,0)~G(0,l)) = l 


G(1,0)~G(1,1) 


xc, 






< 


0(G(1,O)~G(1,1)) = 1 


G(2,0)~G(2,1) 


Xb^ + Xc^ 






< 


0(G(2,O)~G(2,1))=1 


G(3,0)~G(3,1) 


Xb, 






< 


a(G(3,0)~G(3,l))=l 


G(4,0)~G(4,1) 


XBy 






< 


a(G(4,0)~G(4,l))=l 


G(0,1)~G(0,2) 


XA2 + XC2 






< 


o(G(0,l)~G(0,2)) = l 


G(1,1)~G(1,2) 


Xa^+XCj 






< 


o(G(l,l)~G(l,2)) = l 


G(2,1)~G(2,2) 


XAi+Xa^ + 


XCi+Xc^ 


< 


0(G(2,1)~G(2,2))=1 


G(0,2) ~ G(0,3) 


XA2 + X_4^ 






< 


0(G(O,2) ~ G(0,3)) = 1 


G(1,2)~G(1,3) 


Xa, 






< 


o(G(l,2)~G(l,3)) = l 


G(2,2)~G(2,3) 


Xa^ 






< 


o(G(2,2)~G(2,3)) = l 


Objective Function: 












Maximize 


XAI+XA2 


+ 0.99 


Xa, 


+ 0.99 ■ x,4^ 




+ Xb^+Xb2 


+ 0.99 


Xb, 








+ Xc, + Xc, 


+ 0.99 


XCj 


+ 0.99 ■ xc^ 



► 5.7.2 Rip-Up and Reroute (RRR) 

Modem ILP solvers help advanced ILP-based global routers to successfully 
complete hundreds of thousands of routes within hours [5.4][5.12]. However, 
commercial EDA tools require greater scalability and lower runtimes. These 
performance requirements are typically satisfied using the rip-up and reroute (RRR) 
fi-amework, which focuses on problematic nets. If a net cannot be routed, this is 
often due to physical obstacles or other routed nets being in the way. The key idea is 
to allow temporary violations, so that all nets are routed, but then iteratively remove 
some nets (rip-up), and route them differently (reroute) so as to decrease the number 
of violations. In contrast, push-and-shove strategies [5.16] move currently routed 
nets to new locations (without rip-up) to relieve wire congestion or to allow 
previously unroutable nets to become routable. 



An intuitive, greedy approach to routing would route nets sequentially and insist on 
violation- free routes where such routes are possible, even at the cost of large detours. 
On the other hand, the RRR framework allows nets to (temporarily) route through 
over-capacity regions.^ This helps decide which nets should detour, rather than 
detouring the net routed most recently. In the example of Fig. 5.20(a), assume that 
the nets are routed in an order based on the size of the net's aspect ratio and MBB 
(A-B-C-D). If each net is routed without violations (Fig. 5.20(b)), then net D is 
forced to detour heavily. However, if nets are allowed to route with violations, then 
some nets are ripped up and rerouted, enabling D to use fewer routing segments (Fig. 
5.20(c)). 



' Allowing temporary violations is a common tactic for routing large-scale modem (ASIC) designs, 
while routing nets without violations is common for PCBs. 
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I 



D' 






(c) 

Fig. 5.20 Routing without violations versus rip-up and reroute. Let the net ordering be A-B-C-D. (a) 
The routing instance with nets A-D. (b) If all nets are routed without allowing violations, net D is 
forced to detour heavily; the resulting total wirelength is 21. (c) If routes are routed with allowing 
violations, all nets with violation can be ripped up and rerouted, resulting in lower wirelength. In 
this example, net A is rerouted with a shortest-path configuration, nets B and C are slightly detoured, 
and net Z) remains the same with its shortest-path configuration. The resulting wirelength is 19. 

Traditional RRR strategies depend on (1) ripping up and rerouting all nets in 
violation, and (2) selecting an effective net ordering. That is, the order in which nets 
are routed greatly affects the quality of the final solution. For example, Kiih and 
Ohtsuki in 1990 defined quantifiable probabilities of RRR success {success rates) 
for each net in violation, and only ripped-up and rerouted the most promising nets 
[5.14]. However, these success rates are computed whenever a net could not be 
routed without violation, thereby incurring runtime penalties, especially for 
large-scale designs. 

Global Routing Framework (With a Focus on Rip-Up and Reroute) 
Input: unrouted nets Netlist, routing grid G 
Output: routed nets Netlist 

1 . v_nets = 

2. foreach (net net e Netlist) 

3. ROUTE(nef,G) 

4. if(HAS_VIOLATION(nef,G)) 

5. ADD_TO_BACK(i/_nefs,/ief) 



6. while {v_nets i^0\\ !OUT()) 

7. v_nets = REORDER(\/_nefs) 

8. for (/ =1 to / = |i'_nefs|) 

9. nef=FIRST_ELEMENT(i/_nefs) 

10. if (HAS_VIOLATION(nef,G)) 

11. RIP_UP(nef,G) 

12. ROUTE(nef,G) 

13. if(HAS_VIOLATION(nef,G)) 

14. ADD_TO_BACK(v_nefs,nef) 

15. REMOVE_FIRST_ELEMENT(i/_nefs) 



// ordered list of violating nets 

// initial routing 

// route allowing violations 

//if nef has violation, 

// add net to ordered list 

// start RRR framework 

// if nets still have violations or 

// stopping condition is not met 

// optionally change net ordering 

// process first element 

// net still has violation 

// rip up net 

// reroute 

// if still has violation, add to 

// list of violating nets 

// remove first element 



5.8 
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To improve computational scalability, a modem global router keeps track of all nets 
that are routed with violations - the nets go through at least one edge that is 
over-capacity. All these nets are added to an ordered list vjiets (lines 1-5). 
Optionally, vjiets can be sorted to suit a different ordering (line 7). For each net net 
in vjiets (line 8), the router first checks whether net still has violations (line 10). If 
net has no violations, i.e., some other nets have been rerouted away from congested 
edges used by net, then net is skipped. Otherwise, the router rips up and reroutes net 
(lines 11-12). If net still has violations, then the router adds net to vjiets. This 
process continues until all nets have been processed or a stopping condition is 
reached (lines 6-15). Variants of this framework include (1) ripping up all violating 
nets at once, and then rerouting nets one by one, and (2) checking for violations after 
rerouting all nets. 

Notice that in this RRR framework, not all nets are necessarily ripped up. To further 
reduce runtime, some violating nets can be selectively chosen (temporarily) not to 
be ripped up. This typically causes wirelength to increase by a small amount, but 
reduces runtime by a large amount [5.11]. In the context of negotiated congestion 
routing (Sec. 5.8.2), nets are ripped-up and rerouted to also build up appropriate 
history costs on congested edges. Maintaining these history costs improves the 
success of rip-up and reroute and decreases the significance of ordering. 
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As chip complexity grows, routers must limit both routed interconnect length and 
the number of vias, as this greatly affects the chip's performance, dynamic power 
consumption, and yield. Violation-free global routing solutions facilitate smooth 
transitions to design for manufacturahility (DFM) optimizations. Completing global 
routing without violations allows the physical design process to move on to detailed 
routing and ensuing steps of the flow. However, if a placed design is inevitably 
unroutable or if a routed design exhibits violations, then a secondary step must 
isolate problematic regions. In cases where numerous violations are found, repair 
is commonly performed by repeating global or detailed placement and injecting 
whitespace into congested regions. 

Several notable global routers have been developed for the ISPD 2007 and 2008 
Global Routing Contests [5.18]. In 2007, FGR [5.21], MaizeRouter [5.16], and 
BoxRouter [5.4] claimed the top three places. In 2008, NTHU-Route 2.0 [5.2] and 
NTUgr [5.3], which focused on better solution quality, and FastRoute 3.0 [5.23], 
which focused on runtime, took the top three places.* Fig. 5.21 shows the general 
flow for several global routers, where each router uses a unique set of 
optimizations targeting a particular tradeoff between runtime and solution quality. 



t.O [5.22] was released shortly after the contest, with both solution quality and runtime 
improvements compared to FastRoute 3.0. 
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Given a global routing instance - a netlist and a routing grid with capacities - a 
global router first splits nets with three or more pins into two-pin subnets. It then 
produces an initial routing solution on a two-dimensional grid. If the design has no 
violations, the global router performs layer assignment - mapping 2D routes onto 
a 3D grid. Otherwise, nets that cause violations are ripped up and rerouted. This 
iterative process continues until the design is violation-free or a stopping condition 
(e.g., CPU limit) is reached. After rip-up and reroute, some routers perform an 
optional clean-up pass to further minimize wirelength. Other global routers directly 
route on the 3D grid; this method tends to improve wirelength, but is slow and may 
fail to complete routing. More information on the global routing flow, optimizations 
and implementation can be found in [5. 11]. 



Global Routing Instance 



T 



Net Decomposition 



Layer Assignment 



(optional) 



Initial Routing 




Final Improvements 



Rip-up and Reroute -' 



Fig. 5.21 Standard 
global routing flow. 



Individual nets are often routed by constructing point-to-point connections using 
maze routing (Sec. 5.6.3) and pattern routing (Sec. 5.8.1) [5.22]. A popular method 
to control each net's cost is negotiated congestion routing (NCR) (Sec. 5.8.2) [5. 15]. 

► 5.8.1 Pattern Routing 

Given a set of two-pin (sub)nets, a global router must find paths for each net while 
respecting capacity constraints. Most nets are routed with short paths to minimize 
wirelength. Maze-routing techniques such as Dijkstra's algorithm and A* search 
can be used to guarantee a shortest path between two points. However, these 
techniques can be unnecessarily slow, especially when the generated topologies 
are composed of edges (point-to-point connections) that are routed using very few 
vias, such as an Z-shape. In practice, many nets' routes are not only short but also 
have few bends. Therefore, few nets require a maze router. 



To improve runtime, pattern routing searches through a small number of route 
patterns. It often finds paths that cannot be improved, rendering maze routing 
unnecessary. Given an m >^ n bounding box where n = k ■ m and A; is a constant, 
pattern routing takes 0{n) time, while maze routing requires 0{n^ log n) time. 
Topologies commonly used in pattern routing include Z-shapes, Z-shapes, and 
(7-shapes (Fig. 5.22). 
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Fig. 5.22 Common patterns used to route two-pin nets such asZ-shapes, Z-shapes, and [/-shapes. 

► 5.8.2 Negotiated Congestion Routing 

Modem routers perfomi rip-up and reroute using negotiated congestion routing 
(NCR), where each edge e is assigned a cost value cost{e) that reflects the dernand 
for edge e. A segment from net net that is routed through e pays a cost of cost(e). 
The total cost of net is the sum of cost{e) values taken over all edges used by net. 



cost{net)= y cost{e) 



A higher cost{e) value discourages nets from using e and implicitly encourages nets 
to seek out other, less used edges. Iterative routing approaches use methods such as 
Dijkstra's algorithm or A* search to find routes with minimum cost while respecting 
edge capacities. That is, during the current iteration, all nets are routed based on the 
current edge costs. If any nets cause violations, i.e., some edges e are congested, 
then (1) the nets are ripped up, (2) the costs of edges that these nets cross are 
updated to reflect their congestion, and (3) the nets are rerouted in the next iteration. 
This process continues until all nets are routed or some stopping condition is 
reached. 



The edge cost cost(e) is increased according to the edge congestion (p(e), defined as 
the total number of nets passing through e divided by the capacity of e. 



9(e) ^ 



n(e) 

o(e) 
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If e is uncongested, i.e., (p(e) < 1, then cost{e) does not change. If e is congested, i.e., 
(p(e) > 1, then cost(e) is increased so as to penalize nets that use e in subsequent 
iterations. In NCR, cost(e) only increases or remains the same. '' In practice, the edge 
costs are updated after initial routing and after every subsequent routing iteration. As 
such, the number of different routes having identical costs is reduced for each net, 
making net ordering less important for global routing. 

The rate Acost{e) at which cost{e) grows must be controlled. If Acost{e) is too high, 
then entire groups of nets will be simultaneously pushed away from one edge and 
toward another. This can cause the nets' routes to bounce back and forth between 
edges, leading to longer runtime and longer routes, and possibly jeopardizing 
successful routing. On the other hand, if Acost{e) is too low, then many more 
iterations will be required to route all nets without violation, causing an increase in 
runtime. Ideally, the rate of cost increase should be gradual so that a fraction of nets 
will be routed differently in each iteration. In different routers, this rate has been 
modeled by linear fiinctions [5.3], dynamically changing logistic functions [5.16], 
and exponential functions with a slowly-growing constant [5.1][5.21]. In practice, 
well-tuned NCR-based routers can effectively reduce congestion while maintaining 
low wirelength, defined as the routed length plus the number of vias (Fig. 5.23). 

Evolution of Wirelength and Violations During Rip-Up and Reroute 
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Fig. 5.23 Progression of wirelength and violation count during tlie rip-up and reroute stage. 



If cost{e) were to decrease, tlien nets previously penalized for using e no longer have that cost, and 
the effort to route these nets in previous iterations will be wasted since those nets will then use the 
same edges as before. 
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Chapter 5 Exercises 



Exercise 1: Steiner Tree Routing 

Given the six -pin net on the routing grid (right). 

(a) Mark all Hanan points and draw the MBB. 

(b) Generate the RSMT using the heuristic in 
Sec. 5.6.1. Show all intermediate steps. 

(c) Determine the degree of each Steiner point 
in the RSMT for the given net. 

(d) Determine the maximum number of Steiner 
points that a three-pin RSMT can have. 
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Exercise 2: Global Routing in a Connectivity Graph 

Two nets A and B and the connectivity graph with capacities are given. Determine 
whether this placement is routable. If the placement is not routable, explain why. In 
either case, routable or unroutable, calculate the remaining capacities after both nets 
have been routed. 





Exercise 3: Dijkstra's Algorithm 

For the graph with weights (wi,W2) shown below, use Dijkstra's algorithm to find a 
minimum-cost path from the starting node s ^ ato the target node t = /. Generate the 
tables for Groups 2 and 3 as in the example of Sec. 5.6.3. 



(3,2) 



(5,3) 
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Exercise 4: ILP-Based Global Routing 

Modify the example given in Sec. 5.7.1 by disallowing Z-shape routes. Give the full 
ILP instance and state whether it is feasible, i.e., has a valid solution. If a solution 
exists, then illustrate the routes on the grid. Otherwise, explain why no solution 
exists. 

Exercise 5: Shortest Path with A* Search 

Modify the example illustrated in Fig. 5.19 by removing one obstacle. Number the 
nodes searched as in Fig. 5. 19(b). 

Exercise 6: Rip-Up and Reroute 

Consider rip-up and reroute on an m x m grid with n nets. Estimate the required 
memory usage. Choose from the following. 



Oini) 
0(m^- n) 



0{nf+n) 



0{m 
0(in ■ n) 



n') 



0{m-n^) 
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6 Detailed Routing 



Recall from Chap. 5 that the layout region is represented by a coarse grid consisting 
of global routing cells (gcells) or more general routing regions (channels, 
switchboxes) during global routing. Afterward, each net undergoes detailed routing. 

The objective of detailed routing is to assign route segments of signal nets to specific 
routing tracks, vias, and metal layers in a manner consistent with given global 
routes of those nets. These route assignments must respect all design rules. 

Each gcell is orders of magnitude smaller than the entire chip, e.g., 10 x 10 routing 
tracks, regardless of the actual chip size. As long as the routes remain properly 
connected across all neighboring gcells, the detailed routing of one gcell can be 
performed independently of the routing of other gcells. This facilitates an efficient 
divide-and-conquer framework and also enables parallel algorithms. Thus, detailed 
routing runtime can (theoretically) scale linearly with the size of the layout. 
Traditional detailed routing techniques are applied within routing regions, such as 
channels (Sec. 6.3) and switchboxes (Sec. 6.4). For modern designs, over-the-cell 
(OTC) routing (Sec. 6.5) allows wires to be routed over standard cells. Due to 
technology scaling, modem detailed routers must account for manufacturing rules 
and the impact of manufacturing faults (Sec. 6.6). 



6.1 Terminology £iL 



Channel routing is a special case of detailed routing where the connections between 
terminal pins are routed within a routing region (channel) that has no obstacles. The 
pins are located on opposite sides of the channel (Fig. 6.1, left). By convention, the 
channel is oriented horizontally - pins are on the top and bottom of the channel. In 
row-based layouts, in a given block, the routing channels typically have uniform 
channel width. In gate-array and standard-cell circuits that use more than three 
layers of metal, channel height, the number of routing tracks between the top and 
bottom boundaries of the channel, is also uniform. 

Switchbox routing is performed when pin locations are given on all four sides of a 
fixed-size routing region (switchbox. Fig. 6. 1, right). This makes the detailed routing 
significantly more difficult than in channel routing. Switchbox routing is further 
discussed in Sec. 6.4. 

OTC (over-the-cell) routing uses additional metal tracks, e.g., on Metal3 and Metal4, 
that are not obstructed by cells, allowing routes to cross cells and channels. An 
example is shown in Fig. 6.2. OTC routing can use only the metal layers and tracks 
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that the cells do not occupy. When the cells utilize only the poly silicon and Metall 
layers, routing can be perforaied on the remaining metal layers (Metal2, Metal3, etc.) 
as well as unused Metall resources. OTC routing is further discussed in Sec. 6.5. 



Channel Routing 
B ""b'cd'b C 



Switchbox Routing 






Vertical Channel Tracks 

Fig. 6.1 Example of two-layer channel and switchbox routing. The pins of each net A-D are all 
positioned perpendicular to the channel. Each layer has a preferred direction - one layer has only 
horizontal tracks and one layer has only vertical tracks. Vias are used if routing a net requires both 
horizontal and vertical tracks. 

In classical channel routing, the routing area is a rectangular grid (Fig. 6.3) with pin 
locations on top and bottom boundaries. The pins are located on the vertical grid 
lines or columns. The channel height depends on the number of tracks that are 
needed to route all the nets. In two-layer routing, one layer is reserved exclusively 
for horizontal tracks while the other is reserved for vertical tracks. The preferred 
direction of each routing layer is determined by the floorplan and the orientation of 
the standard-cell rows. In a horizontal cell row, polysilicon (transistor gate) 
segments are typically vertical (V) and Metall segments are horizontal (H). The 
metal layers' preferred directions then alternate between H and V. To connect to a 
cell pin that is on the Metall layer, the router will drop one or more vias from a 
Metal2 routing segment. 
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Fig. 6.2 An example of OTC routing. Note that fewer routing tracks are required if some nets are 
routed over the cell area. 
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Columns 

^ b c d e f ^ 
B B C D B C 



Vertical Segment 
(Branch) 

Horizontal Segment 
(Trunk) 




<= ^ 



o 



A C A B B C 

Pin Locations 

Fig. 6.3 Terminology related to channel routing for a horizontal channel. The columns are the 
vertical grid lines while the tracks are the horizontal grid lines. 

Given a channel, its upper and lower boundaries are each defined by a vector of net 
IDs, denoted as TOP and BOT, respectively. Here, each column is represented by 
two net IDs (Fig. 6.3), one from the top channel boundary and one from the bottom 
channel boundary. Unconnected pins are given a net ID of 0. In the exarnple of Fig. 
6.3, TOP ^[BOBCDBC] andBOT^ [ACABOBC]. 

A horizontal constraint exists between two nets if their horizontal segments overlap 
when placed on the same track. The example in Fig. 6.4 includes one horizontal and 
one vertical routing layer, with nets B and C being horizontally constrained. If the 
two nets' horizontal segments do not overlap, then they can both be assigned to the 
same track and are horizontally unconstrained, e.g., nets A and B in Fig. 6.4. 




Horizontally 
Constrained 



A \B C 

Horizontally Unconstrained 

Fig. 6.4 Example of horizontally constrained and unconstrained nets. Nets B and C are horizontally 
constrained and thus require different horizontal tracks. 



A vertical constraint exists between two nets if they have pins in the same column. 
In other words, the vertical segment coming from the top must "stop" within a short 
distance so that it does not overlap with the vertical segment coming from the 
bottom in the same column (Fig. 6.5). If each net is assigned to a single horizontal 
track, then the horizontal segment of a net from the top must be placed above the 
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horizontal segment of a net from the bottom in the same column. In Fig. 6.5(a), this 
vertical constraint assigns the horizontal segment of net ^ to a track above the 
horizontal segment of net B. To satisfy these constraints, at least three columns are 
required to "uncross" the two nets. 
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Fig. 6.5 Examples of vertically constrained nets, (a) Nets A and B do not have a vertical conflict, (b) 
Nets A and B have a vertical conflict. The two nets can be routed by splitting the vertical segment of 
one net and using an additional third track. 

Although a vertical constraint implies a horizontal constraint, the converse is not 
necessarily true. However, both types of constraints must be satisfied when 
assigning segments in a channel. 



6.2 



6.2 Horizontal and Vertical Constraint Graphs 



The relative positions of nets in a channel routing instance, encoded with horizontal 
and vertical constraints, can be modeled by horizontal and vertical constraint graphs, 
respectively. These graphs are used to (1) initially predict the minimum number of 
tracks that are required and (2) detect potential routing conflicts. 

>■ 6.2.1 Horizontal Constraint Graphs 



Zone representation. In a channel, all horizontal wire segments must span at least 
the leftmost and rightmost pins of their respective nets. Let S{col) denote the set of 
nets that pass through column col. In other words, S{col) contains all nets that either 
(1) are connected to a pin in column col or (2) have pin connections to both the left 
and right of col. Since horizontal segments cannot overlap, each net in S{col) must 
be assigned to a different track in column col. Only a subset of all columns is needed 
to describe the entire channel. If there exist columns ; andy such that S{i) is a subset 
of Sij), then S{i) can be ignored since it imposes fewer constraints on the routing 
solution than Sij). In Fig. 6.6, every S{col) is a subset of at least one of 5(c), 5(/), S{g) 
or S{i). Furthermore, the maximal columns c,f,g and / comprise a minimal set of 
columns with this property. Note that these columns together contain all the nets. 
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Fig. 6.6 A channel routing problem (top) and its corresponding zone representation (right). Only the 
maximal columns are shown. 



Graphical representation. The nets within the channel can also be represented by a 
horizontal constraint graph HCG{V,E), where nodes v e K correspond to the nets of 
the netlist and an undirected edge e{ij) e E exists between nodes / and j if the 
corresponding nets are both elements of some set S(col). In other words, e e E if the 
corresponding nets are horizontally constrained. Fig. 6.7 illustrates the HCG for the 
channel routing instance of Fig. 6.6. A lower bound on the number of tracks 
required by the channel routing can be found from either the HCG or the zone 
representation. This lower bound is given by the maximum cardinality of any S{col). 




Fig. 6.7 The HCG for the channel routing 
instance of Fig. 6.6. With either the HCG 
or the zone representation, a lower bound 
on the number of tracks is five. 



► 6.2.2 Vertical Constraint Graphs 



Vertical constraints are represented by a vertical constraint graph VCG(V,E). A 
node V e K represents a net. A directed edge e{ij) e E connects nodes / andy if net / 
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must be located above nety. However, an edge that can be derived by transitivity is 
not included. For instance, in Fig. 6.8, edge {B,C) is not included because it can be 
derived from edges iB,E) and {E,C). 




Q Q 



Fig. 6.8 The VCG for the channel routing 
instance of Fig. 6.6. 



A cycle in the VCG indicates a conflict where vertical segments of two nets overlap 
at a specific column. That is, the horizontal segments of the two nets would have to 
be simultaneously above and below each other. This contradiction can be resolved 
by splitting the net and using an additional track (Fig. 6.9). 
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Fig. 6.9 A cliannel routing problem (left), 
its VCG with a cycle or conflict (center), 
and a possible solution using net splitting 
and an additional third track (right). 



If a cycle occurs in a maximum-cardinality set S(coP) of the HCG, the lower bound 
on the number of required tracks, based on the number of nets in S(col), is no longer 
tight. Nets must now be split and the minimum number of required tracks must be 
adjusted to account for both HCG and VCG conflicts. 



Example: Vertical and Horizontal Constraint Graphs 
Given: channel routing instance (right). Column 

Task: find the horizontal constraint graph 
(HCG) and the vertical constraint graph 
(VCG)ofnetsyl-i^. 
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Solution: 

ro/' and 50r vectors: TOP=[0BDBA CE],BOT=[D C E FOA F] 
Determine S{col) for col = a ... g. 

S(a)={D} S{b)={B,CJ)] S{c)={B,CJ),E} S(d)={B,C,EJ^ 

S(e)={A,C,EJ^} S(f,={A,C,EJ^} 5(g) ={£/■} 



Find the maximal Sicol). 

S{a) ={D} and 5(6) ={B,CJD} are both subsets of 5(c) ={5,C,D,£'}. 
5(/) = {A,C^J^ and S(g) = {EJ^'} are both subsets of 5(e) = {A,C^J^}. 
Maximal S(co[): 5(c) = {B,Cj:iE} S(d) = {B,C^,F} 5(e) = {A,C^J^ 
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Find the HCG and VCG. 
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Since tiiere are no cycles in tiie VCG, no net splitting is required (Sec. 6.2.2). Thus, each net 
only needs one horizontal segment for routing. 

The track assignment is based on both the VCG and the HCG. For instance, based on the 
VCG, net D is assigned to the topmost track. The other net that is on the top level in the VCG 
(net B) is assigned to another track due to the HCG. 

The HCG determines a lower bound on the number of required tracks. Since there are no 
cycles (conflicts) in the VCG, the minimum number of tracks is equal to the cardinality of the 
largest S{cor), here being \S(c)\ = \S{d}\ = \S(e)\ = 4. 



6.3 Channel Routing Algorithms 



6.3 



Channel routing seeks to minimize the number of tracks required to complete 
routing. In gate-array designs, channel height is typically fixed, and algorithms are 
designed to pursue 100% routing completion. 

► 6.3.1 Left-Edge Algorithm 

An early channel routing algorithm was developed by Hashimoto and Stevens [6.8]. 
Their simple and widely used left-edge heuristic, based on the VCG and the zone 
representation, greedily maximizes the usage of each track. The former identifies the 
assignment order of nets to tracks, and the latter determines which nets may share 
the same track. Each net uses only one horizontal segment (trunk). 



The left-edge algorithm works as follows. Start with the topmost track (line 1). For 
all unassigned nets nets_unassigned (line 3), generate the VCG and the zone 
representation (lines 4-5). Then, in left-to-right order (line 6), for each unassigned 
net n, assign it to the current track if (1) « has no predecessors in the VCG and (2) it 
does not cause a conflict with any nets that have been previously assigned (lines 
7-11). Once n has been assigned, remove it from netsjmassigned (line 12). After all 
unassigned nets have been considered, increment the track index (line 13). Continue 
this process until all nets have been assigned to routing tracks (lines 3-13). 
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Left-Edge Algorithm 

Input: channel routing instance CR 

Output: tracl< assignments for each net 

1 . curr_track = 1 

2. netsunassigned = Netlist 

3. while {nets_unassigned != 0) 

4. VCG = VCG(CR) 

5. ZR = ZONE_REP(CR) 

6. SORJ{nets_unassigned, start column) 

7. for(/=1 to |nefs_tynass/gned|) 

8. curr_net = nets_unassignedli\ 

9. if (PARENTS(curr_nef) == && 

1 0. {JRY_ASS\GN{curr_net,curr_tracl<)) 

11. ASS\GN{curr_net,curr_tracl<) 

1 2. REMO\/E{nets_unassigned,curr_net) 

1 3. curr track = curr track + 1 



// start with topmost track 

// while nets still unassigned 
// generate VCG and zone 
// representation 
// find left-to-right ordering 
// of all unassigned nets 



// if curr_net has no parent 
// and does not cause 
// conflicts on currjtrack, 
II assign currnet 

II consider next track 



The left-edge algorithm finds a solution with the minimum number of tracks, or the 
maximum cardinality of S{col), if there are no cycles in the VCG. Yoshimura [6.18] 
enhanced track selection by incorporating net length in the VCG. Yoshimura and 
Kuh [6. 19] improved track utilization by net splitting before constructing the VCG. 



Example: Left-Edge Algorithm 

Given: channel routing instance 

(right). 

Task: use the left-edge algorithm to 

route nets A-J in the channel. 



Q A D E 



F G D I J J 



BCECEBFHIHG 



Solution: 

currjrack= 1, VCG and zone representation: 

A D J 

\ / \ / \ 
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J 








E F 









VCG 



Zone Representation 



Nets A, D and J have no predecessors in the VCG and can be assigned to currjrack = 1 . Net 
A is assigned first because it is the leftmost. Net D can no longer be assigned to currjrack 
because of a conflict with net A. However, net J can be assigned to currjrack, since there is 
no conflict. Remove nets A and J from the VCG and the zone representation. 
currjrack = 2, VCG and zone representation: 
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VCG 



Zone Representation 



Nets D and G have no predecessors in the VCG and can be assigned to currjrack = 2. Net D 
is assigned first because it is the leftmost. Net G can no longer be assigned to currjrack 
because of a conflict with net D. Remove net D txom the VCG and the zone representation. 



currjrack = 3, VCG and zone representation: 
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VCG 



Zone Representation 



Nets E, G and / have no predecessors in the VCG and can be assigned to currjrack = 3. Net 
E is assigned first because it is the leftmost. Net G is assigned to currjrack because it does 
not conflict with any nets. Net / cannot be assigned because of a conflict with net G. Remove 
nets E and G from the VCG and the zone representation. 

currjrack = 4 and currjrack = 5 are similar to previous iterations. Nets C, F and / are 
assigned to traclc 4. Nets B and H are assigned to track 5 . 



Routed channel: 



OADEAFGODIJJ 



currjrack = 1 
2 
3 
4 
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>■ 6.3.2 Dogleg Routing 



To deal with cycles in the VCG, a dogleg (/--shape) can be introduced. Doglegs not 
only alleviate conflicts in VCGs (Fig. 6. 10), but also help reduce the total number of 
tracks (Fig. 6.11). 




ABB 




ABB 



i-k 



■^ Dogleg 



^' 



BOA 



^ 



e 



A 



Fig. 6.10 A dogleg is introduced to 
route around tlie conflict for net B. 



The dogleg algorithm, developed by Deutsch [6.6] in the 1970s, eliminates cycles in 
VCGs and reduces the number of routing tracks (Fig. 6.1 1). The algorithm extends 
the left-edge algorithm by splitting /)-prn nets (p>2) into/? - 1 horizontal segments. 
Net splitting to introduce doglegs occurs only in columns that contain a pin of the 
given net, under the assumption that additional vertical tracks are not available. 
After net splitting, the algorithm follows the left-edge algorithm (Sec. 6.3.1). The 
subnets are represented by the VCG and zone representation. 



A A B B 



e C C 

(a) Channel routing 
problem 

A A B B 



•-A- 



(b) VCG without 
net splitting 



A A 


e e 


1 t j j • 
II,, 

1 1 


1 1 1 
1 1 1 


e c c 

(c) Channel routing 
solution 

A A B B 
1 1 : : • 



•■Bi^B^-* 



B C C 
(d) Net splitting 



(e) VCG with 
net splitting 



6 C C 

(f) Channel routing 
solution 



Fig. 6.11 Example showing how net splitting can reduce the number of required tracks, (a) The 
channel routing instance, (b) The conventional VCG without net splitting, (c) The channel routing 
solution without net splitting uses thi'ee tracks, (d) Net splitting applied to the channel routing 
solution from (c). (e) The new VCG after net splitting, (f) The new channel routing solution with 
net splitting uses only two tracks. 
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Example: Dogleg Left-Edge Algorithm 

Given: channel routing instance (right). 

Task: use the dogleg left-edge algorithm to route the 

nets A-D in the channel. 



Column 



a b 


c 


d 


e 


f 


C D 





D 


A 


A 





Solution: B B C C D 

Split the nets, find S{col) and determine the zone representation. Note: subnets of a net can be 
placed on the same track regardless of overlap in the zone representation. 

a h c d e f 

C D D A A 



•— C|— •— c,— • 


S(a)= |5,C,| 




5(&)={5,C,A} 


•-B-* 


5(c)={Ci,C2,D,] 


•—£>,—•—£>,—• 


S(d) = {a,£i, A 




S{e) = {A,C,JD,} 


•-A-* 


S{f) = {A,D.i 



B B C C D 
Net Splitting 



S(col) 



Find the VCG. 
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^ 



\/ 



B 



'7^ 
/ \ 



C, 



d) 



Track assignment: 

cwT_track= 1 

Consider nets C], Di and A. Assign net Ci first, since it is the leftmost net in the zone 

representation. Of the remaining nets, only net A does not cause a conflict. Therefore, assign 

net^ to curr track. Remove nets C] and A from the VCG. 

currjrack = 2 

Consider nets D\, Ci and Di- Assign net Di first, since it is the leftmost net in the zone 
representation. Of the remaining nets, only net D2 does not cause a conflict. Therefore, assign 
netD2 to cwrjrack. Remove nets Di anADj from the VCG. 

currjrack = 3 

Consider nets B and €2- Assign net B first, since it is the leftmost net in the zone 

representation. Net Cj does not cause a conflict. Therefore, assign net C2 to currjrack. 



Routed channel: 



C D Q D A 



currjrack = 1 
2 
3 



• 

I i- 



B B C C D 
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6.4 



6.4 Switchbox Routing 



Recall that switchboxes have fixed dimensions and include pin connections on all 
four sides. Switchbox routing seeks to connect all pins in each set with identical 
labels. Nets can be routed on specific horizontal tracks or vertical columns, and are 
allowed to cross when tracks and columns are on multiple layers. Compared to 
channel routing, pins on all four sides lead to a larger number of crossings and 
greater complexity, since a switchbox router cannot insert new tracks. If switchbox 
routing fails, then new tracks can be added to the switchbox and adjacent channels, 
and switchbox routing is attempted again. In this section, the algorithm descriptions 
are simplified by not accounting for routing obstacles. However, all algorithms can 
be extended accordingly. 

^ 6.4.1 Terminology 

A switchbox is defined as an (m + 1) x (« + 1) region with . . . (m + 1) columns and 
...(« + 1) rows. The O' and {m + 1)' columns are the left and right borders of the 
switchbox, and the 0" and {n + 1)' rows are the lower and upper borders of the 
switchbox. The l" through m columns are labeled with lowercase letters, e.g., a 
and h; the 1" through «* tracks are labeled with numbers, e.g., 1 and 2. 

A switchbox is defined by four vectors LEFT, RIGHT, TOP and BOT, where they 
each respectively define the pin ordering on the left, right, top, and bottom borders. 
Since pins are located on all four borders, the usability of these borders for routing is 
severely restricted. 



TOP =[ODFHECC] LEFT 
BOT =[OOGHBBH] RIGHT 
Column a b c d e f g 
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Fig. 6.12 An 8 x 7 (wj = 7, /i = 6) switchbox routing problem (left) and a possible solution (right). 
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>■ 6.4.2 Switchbox Routing Algorithms 

Algorithms for switchbox routing can be derived from channel routing algorithms. 
Liik [6.14] extended a greedy channel router by Rivest and Fiduccia [6.16] to 
propose a switchbox routing algorithm with the following key improvements. 

1 . Pin assignments are made on all four sides. 

2. A horizontal track is automatically assigned to a pin on the left. 

3. Jogs are used for the top and bottom pins as well as for horizontal tracks 
connecting to the rightmost pins. 

The performance of this algorithm is similar to that of the greedy channel router 
from [6.16], but it does not guarantee fiiU routability because the switchbox 
dimensions are fixed. 



Example: Switchbox Routing 

Given: the 8 x 7 {m = l,n = 6) switchbox routing instance of Fig. 6. 12. 

TOP =[ODFHECC] LEFT =IA0DFG0] 

BOT =[OOGHBBH] RIGHT =[BHACEC] 
Task: route nets yl-// within the switchbox. 

Solution: 

Column a: Assign net A to track 2. Assign net D to track 6. Extend nets A (track 2), F (track 4), 

G (track 5) and D (track 6). 

Column b: Connect the top pin D to net D on track 6. Assign net G with track 1. Extend nets 

G (track 1), A (track 2) and F (track 4). 

Column c: Connect the top pin F to net F on track 4. Connect the bottom pin G to net G on 

track 1. Assign net^ to track 3. Extend net^ (track 3). 

Column d: Connect the bottom pin H to the top pin H. Extend nets H (track 2) and A (track 3). 

Column e: Connect the bottom pin B with track 1 . Connect the top pin E with track 5. Extend 

nets B (track 1), H (track 2), A (track 3) and E (track 5). 

Column/ Connect the bottom pin B to net B on track 1. Connect the top pin C with track 6. 

Extend nets B (track 1), H (track 2), A (track 3), E (track 5) and C (track 6). 

Column g: Connect the bottom pin //with net //on track 2. Connect the top pin C with net C 

on track 6. Assign net C to track 4. Extend the nets on tracks 1, 2, 3, 4, 5 and 6 to their 

corresponding pins. 

Column a b c d e f g 
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Ousterhout et al. developed a channel and switchbox router that accounts for 
obstacles such as pre-routed nets [6.15] based on the greedy channel router in [6. 16]. 
Cohoon and Heck developed the switchbox router BEA VER, which accounts for vias 
and minimizes total routing area [6.3]. It allows additional flexibility for preferred 
routing directions on individual layers. BEAVER uses the following strategies - (1) 
comer-routing, where a horizontal and vertical segment form a bend, (2) line-sweep 
routing for simple connections and straight segments, (3) thread-routing where any 
type of connection can be made, and (4) layer assignment. BEA VER outperforms 
previous academic routers in terms of routing area and via count. 

Another well-known switchbox router, PACKER, developed by Gerez and 
Herrmann in 1989 [6.7] has three major steps. First, each net is routed 
independently, ignoring capacity constraints. Second, any remaining conflicts are 
resolved using connectivity-preserving local transformations (CPLT). Third, the net 
segments are then locally modified (rerouted) to alleviate routing congestion. 



6.5 



6.5 Over-the-Cell Routing Algorithms 



Most routing algorithms in Sees. 6.3-6.4 have dealt primarily with two-layer routing. 
However, modern standard-cell designs have more than two layers, so these 
algorithms must be extended accordingly. One commonly used strategy proceeds as 
follows. The cells between the channels are placed back-to-back or without routing 
channels. Cells predominantly use only Poly and Metal 1 for internal routing. Higher 
metal layers, e.g., Metal2 and Metal3, are not obstructed by standard cells and are 
typically used for over-the-cell (OTC) routing. These metal layers are usually 
represented by a coarse routing grid made up of gcells. The nets are globally routed 
as Steiner trees (Sec. 5.6. 1) and then detail-routed (Sees. 6.3-6.4). 

In an alternate approach, channels are created between the cells, but are limited to 
the internal cell layers such as Poly and Metal 1. Routing is generally performed on 
higher metal layers, such as Metal2 and MetaB. Since standard cells do not form 
routing obstacles at these higher layers, the concept of a routing channel is irrelevant 
in this context. Therefore, routing is performed on the entire chip area rather than in 
individual channels or switchboxes (Figs. 6.13 and 6.14). 




IVIetal2 
IVIetaM 



Fig. 6.13 OTC routing with two 
metal layers. For another example, 
see Fig. 1.7. 



For designs with more than three layers, gcells within the layout region are 
decomposed and stretched across cell boundaries (Fig. 6.14). In addition, the power 
(VDD) and ground (GND) nets require an alternating cell orientation (Fig. 1.7). 
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Fig. 6.14 Routing (metal) layers 
of a design partitioned into 
global routing cells (gcells). 
Standard-cell regions in Metall 
are colored using dark gray. 



OTC routing sometimes coexists with channel routing. For example, IP blocks 
typically block routing on several lowest metal layers, so the space between IP 
blocks can be broken down as channels or switchboxes. Furthermore, FPGA fabrics 
typically use very few metal layers to reduce manufacturing cost, and therefore 
cluster programmable interconnect into channels between logic elements. FPGAs 
can also include pre-designed multipliers, digital signal processing (DSP) blocks 
and memories, which use OTC routing. Recent FPGAs also include express-wires 
laid out on higher metal layers that may cross logic elements. 

► 6.5.1 OTC Routing Methodology 

OTC routing is perfonned in three steps - (1) select nets that will be routed outside 
the channel, (2) route these nets in the OTC area, and (3) route the remaining nets 
within the channel. Cong and Liu [6.4] solved steps 1 and 2 optimally in 0{n^) time, 
where n is the number of nets. 




OTC routing 
in MetalS 

Channel 

routing in 

MetaM, 

Metal2 

and 
MetalS 



Fig. 6.15 A portion of a three-layer layout showing only the routes and pin connections (Tanner 
Research, Inc.). Metal2 segments are predominantly vertical. Metall and Metal3 segments are 
primarily horizontal; MetalS is wider than other segments (excluding the ground stripe on the left). 
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Fig. 6.15 shows a fragment from a three-layer standard-cell design. The Metal3 
layer is used both for horizontal tracks inside the channel and for OTC routing. 
While early applications of OTC routing used three metal layers, modem ICs use six 
or more metal layers. 

>■ 6.5.2 OTC Routing Algorithms 

This section discusses several advanced OTC routing algorithms. Further details can 
be found in respective publications. The Chameleon OTC router [6.2], developed by 
Braun et al, first generates topologies for nets in either two or three metal layers to 
minimize the total routing area, and then assigns the net segments to tracks. The key 
innovation of Chameleon is its ability to take into account technology parameters on 
a per-layer basis. For example, designers can specify the wire widths and the 
minimum distance between each wire segment for each layer. 

Cong, Wong and Liu in [6.5] chose a different approach to OTC routing. All the nets 
are first routed in two layers, and then mapped onto three layers. Shortest-path and 
planar-routing algorithms are used to map two-layer routes to three-layer routes with 
minimum wiring area. The authors assume an H- V-H three-layer model whereby the 
first and third layers host horizontal tracks, and the second layer hosts vertical tracks. 
The approach can be extended to four-layer routing. 

Ho et al. [6.9] developed another OTC routing technique that routes nets greedily, 
using a set of simple heuristics which are applied iteratively. It achieved recognition 
for solving the Deutsch Difficult Example [6.6] channel routing instance using 19 
tracks on two metal layers, matching the size of the largest S(coP) in the zone 
representation. 

Holmes, Shei'wani and Sarrafzadeh presented WISER [6.10], which uses free pin 
positions and cell area to increase the number of OTC routes (Fig. 6.16), and 
carefiiUy selects nets for OTC routing. 



Free Pin Position- 



Feedthrougii Passes 



—oo*-Q-»-o-»-n ■ ■ D ■ ■ ■ o 





Fig. 6.16 WISER uses free pin positions to reduce the number of channel tracks from four to two 
[6.10]. Feedthroughs are free vertical routing tracks from top to bottom of a cell that enable 
coimections between adjacent channels. When unused pin locations are located opposite each other 
on the top and bottom of a channel, feedthroughs result whereby vertical Metal2 is again free to 
make a connection between adjacent channels. An unused pin location means that there is no 
Metal 1 feature (pin) at that location inside the standard cell, and hence no need for Metal2 to 
connect to any standard-cell pin. Hence, the Metal2 resource is available. 
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6.6 Modern Challenges in Detailed Routing 



6.6 



The need for low-cost, high-performance and low-power ICs has driven technology 
scaling since the 1960s [6. 11]. An important aspect of modern technology scaling is 
the use of wires of different widths on different metal layers. In general, wider wires 
on higher metal layers allow signals to travel much faster than on thinner wires on 
lower metal layers. This helps to recover some benefits from scaling in terms of 
performance, but at the cost of fewer routing tracks. Thicker wires are typically used 
for clock (Sec. 7.5) and supply routing (Sec. 3.7), as well as for global intercormect. 

Manufacturers today use different configurations of metal layers and widths to 
accommodate high-performance designs. However, such a variety of routing 
resources makes detailed routing more challenging. Vias connecting wires of 
different widths inevitably block additional routing resources on the layer with the 
smaller wire pitch. For example, layer stacks in some IBM designs for 130 nm-32 
nm technologies are illustrated in Fig. 6.17 [6.1]. Wires on layers M have the 
smallest possible width X, while the wires on layers C, B, E, U and W are wider - 
1.3X, 2X, 41, lOX, and 161, respectively. The 90 nm technology node was the first to 
introduce different metal layer thicknesses, with thinner wires on the top two layers. 
Today's 32 nm metal layer stacks often incorporate four to six distinct wire 
thicknesses. Advanced lithography techniques used in manufacturing lead to stricter 
enforcement oi preferred routing direction on each layer. 



62 C 



M6i^ 

M2i==i 
M1 1^ 
130 nm 



61 I 

M4i==i 

AC 1=1 

Ml ^ 

90 nm 




Fig. 6.17 Representative layer stacks for 130 nm 
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Semiconductor manufacturing yield is a key concern in detailed routing. To 
safeguard against manufacturing defects, via doubling and non-tree routing insert 
redundant vias and wiring segments as backups in case an electrical connection is 
lost. At advanced technology nodes, manufacturability constraints (design rules) 
become more restrictive and notably complicate detailed routing. For example, 
design rules specify minimum allowed spacing between wires and vias depending 
on their widths and proximity to wire corners. More recent spacing rules take into 
account multiple neighboring polygons. Forbidden pitch rules prohibit routing wires 
at certain distances apart, but allows smaller or greater spacings. 

Via defects. Recall that a (single) via connects two wires on different metal layers. 
However, vias can be misaligned during manufacturing, and are susceptible to 
electromigration effects during the chip's lifetime [6.13]. A partially failing via with 
increased resistance may cause timing violations in the circuit. A via that has failed 
completely may disconnect a net, altering the circuit's function. To protect against 
via failures, modem IC designs often employ double vias. Such protection requires 
additional resources (area), and must obey all design rules. These resources may be 
unavailable around some vias. In some congested areas, only a small subset of vias 
can be doubled [6.17]. Via doubling can be performed by modem commercial 
routers or by standalone yield enhancement tools after detailed routing. 

Interconnect defects. The two most common manufacturing defects in wires are 
shorts (undesired connections) and opens (broken connections). To address shorts, 
adjacent wires can be spread further apart, which also decreases electromagnetic 
interference. However, spreading the wires too far can increase total wirelength, 
thereby increasing the design's exposure to opens. To address opens, non-tree 
routing [6.12] adds redundant wires to already routed nets. However, since 
increasing wirelength directly contradicts traditional routing objectives (Chaps. 5-6), 
this step is usually a post-processing step after detailed routing. Redundant wires 
increase the design's susceptibility to shorts, but make it immune to some opens. 

Antenna-induced defects. Another type of manufacturing defect affects transistors, 
but can be mitigated by constraining routing topologies. It occurs after the transistor 
and one or more metal layers have been fabricated, but before other layers are 
completed. During plasma etching, metal wires not connected to PN-junction nodes 
may collect significant electric charges which, discharged through the gate dielectric 
(Si02 at older technology nodes, high-A: dielectric at newer nodes), can irreversibly 
damage transistor gates. To prevent these antenna effects, detailed routers limit the 
ratio of metal to gate area on each metal layer. Specifically, they restrict the area of 
metal polygons connected to gates without being connected to a source/drain 
implant. When such antenna rules are violated, the simplest fix is to transfer a 
fraction of a route to a higher layer through a new or relocated via. 

Some researchers have also proposed manufacturability-aware routers, where 
detailed routing explicitly optimizes yield. However, it is difficult to objectively 
quantify the benefit of such optimizations before manufacturing. As a result, such 
techniques have not yet caught on in the industry. 
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Chapter 6 Exercises 

Exercise 1: Left-Edge Algorithm 

Given a channel with the following pin connections (ordered left to right). 
TOP^[ABA0ED0F]andBOT^[BCDACFE0l 

(a) Find S{col) for columns a-h and the minimum number of routing tracks. 

(b) Draw the HCG and VCG. 

(c) Use the left-edge algorithm to route this channel. For each track, mark the 
placed nets and draw the updated VCG from (b). Draw the channel with the 
fijUy routed nets. 

Exercise 2: Dogleg Left-Edge Algorithm 

Given a channel with the following pin connections (ordered left to right). 
TOP ^[AABOADCE]and BOT^ [0 B CA CEDD]. 

(a) Draw the vertical constraint graph (VCG) without splitting the nets. 

(b) Determine the zone representation for nets A-E. Find S{coP) for columns a-h. 

(c) Draw the vertical constraint graph (VCG) with net splitting. 

(d) Find the minimum number of required tracks with net splitting and without net 
splitting. 

(e) Use the Dogleg left-edge algorithm to route this channel. For each track, state 
which nets are assigned. Draw the final routed channel. 

Exercise 3: Switchbox Routing 

Given the nets on each side of a switchbox, 

(ordered bottom-to-top) LEFT ^[OGAFBO] RIGHT = [ODCEGO] 
(ordered left-to-right) BOT ^[OAFGDO] TOP ^[OACEBD] 

Route the switchbox using the approach shown in the example in Sec. 6.4.2. For 
each column, mark the routed nets and their corresponding tracks. Draw the 
switchbox with all nets routed. 

Exercise 4: Manufacturing Defects 

Consider a region with high wiring congestion and a region where routes can be 
completed easily. For each type of manufacturing defect discussed in Sec. 6.6, is it 
more likely to occur in a congested region? Explain your answers. You may find it 
usefiil to visualize congested and uncongested regions using small examples. 

Exercise 5: Modern Challenges in Detailed Routing 

Develop an algorithmic approach to double-via insertion. 

Exercise 6: Non-Tree Routing 

Discuss advantages and drawbacks of non-tree routing (Sec. 6.6). 
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7 Specialized Routing 



For signal wires in digital integrated circuits, global routing (Chap. 5) is perfonned 
first, and detailed routing next (Chap. 6). However, some types of designs, such as 
analog circuits and printed circuit boards (PCBs) with gridless (trackless) routing, do 
not warrant this distinction. Smaller, older designs with only one or two metal layers 
also fall into this category. When global and detailed routing are not performed 
separately, area routing (Sees. 7.1-7.2) directly constructs metal routes for signal 
connections. Unlike routing with multiple metal layers, area routing emphasizes 
crossing minimization. Non-Manhattan routing is discussed in Sec. 7.3, and nets that 
require special treatment, such as clock signals, are discussed in Sees. 7.4-7.5. 



7.1 Introduction to Area Routing 



7.1 



The goal of area routing is to route all nets in the design (1) without global routing, 
(2) within the given layout space, and (3) while meeting all geometric and electrical 
design rules. Area routing performs the following optimizations. 

— minimizing the total routed length and number of vias of all nets 

— minimizing the total area of wiring and the number of routing layers 

— minimizing the circuit delay and ensuring an even wire density 

— avoiding harmful capacitive coupling between neighboring routes 

Area routing is performed subject to technology (number of routing layers, minimal 
wire width), electrical (signal integrity, coupling), and geometry (preferred routing 
directions, wire pitch) constraints. Electrical and technology constraints are 
traditionally represented by geometric rules, but modem routers seek to handle them 
directly to improve modeling accuracy. Nevertheless, reducing total wirelength may 
reduce circuit area, increase yield, and improve signal integrity. For example, the 
configuration on the left is preferred because its total length is minimal (Fig. 7. 1). 
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Fig. 7.1 Two different routing possibilities for tlie two-pin net connecting pins aj, and Ca,. 
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To measure wirelength, the (straight-line) Euclidean distance di and the (rectihnear) 
Manhattan distance dm are used. For two points Pi {xi,yi} and P2 {x2,y2} in the plane, 
the Euclidean distance is defined as 



and the Manhattan distance is defined as 

dMiP\^Pi) = \xi-Xx\ + \y2 -JihH + lAy] 
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By definition, Manhattan paths include only vertical and horizontal segments. 
Analog circuits and some MCMs can use unrestricted Euclidean routing while 
digital circuits use track-based Manhattan routing. The following facts and 
properties are relevant to VLSI routing. 



Euclidean Shortest Path 




IVIanhattan Shortest Paths 



Consider all shortest paths between points P\ and P2 
in the plane. The Euclidean shortest path is unique, 
but there may be multiple Manhattan shortest paths. 
With no obstacles, the number of Manhattan shortest 
paths in an Ax x Aj^ region is 

Ax + A)A _ r Ax + Aj'^ _ (Ax + Ay)! 
Ax ]\ Ay J~ Ax! Ay! 

The example (right) has 35 paths (Ax = 4 and Aj = 3). 



Two pairs of points may admit two non-intersecting 
Manhattan shortest paths, while their Euclidean 
shortest paths intersect. 



If all pairs of Manhattan shortest paths between two 
pairs of points intersect, then so do Euclidean shortest 
paths. 



The Manhattan distance is equal to the Euclidean distance for single horizontal and 
vertical segments, but is otherwise larger. 

if^ 1 1 .41 worst case: a square where Ax =Ay 




d^ [1 .27 on average, without obstacles 
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7.2 Net Ordering in Area Routing 



7.2 



The results and runtime of area routing can be very sensitive to the order in which 
nets are routed. This is especially true for Euclidean routing, which allows fewer 
shortest paths and is therefore prone to detours. Routing multiple nets by greedily 
optimizing the wirelength of one net at a time may produce inferior configurations 
with unnecessarily large numbers of routing failures (Fig. 7.2) and total wirelength 
(Fig. 7.3). Furthermore, multi-pin nets increase the complexity of net ordering when 
they are decomposed into two-pin subnets. Therefore, some algorithms determine 
net and pin ordering before routing. 
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Fig. 7.2 Effect of net ordering on routability. (a) An optimal routing of net A prevents net B from 
being routed, (b) An optimal routing of net B prevents net A from being routed, (c) Nets A and B 
can be simultaneously routed only if each uses more than minimum wirelength. 

The choice of net and pin ordering depends on the type of the routing algorithm used. 
Pin ordering can be optimized using, e.g., (1) Steiner tree-based algorithms (Sec. 5.6) 
or other methods to decompose multi-pin nets to two-pin nets, or (2) geometric 
criteria. For example, pin locations can be ordered by (non-decreasing) x-coordinate 
and connected from left to right. Given previously-connected pins, the next pin is 
connected using a shortest-path algorithm. 
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Fig. 7.3 Effect of net ordering on total wirelength, 
where routing net A first (left) is worse than 
routing net B first (right). 



For n nets, there are «! possible net orderings. In the absence of clear criteria and 
polynomial-time algorithms, constructive heuristics are used. These heuristics 
prioritize nets quantitatively or order them in pairs, as illustrated below. For a net net, 
let MBB(net) be the minimum bounding box containing the pin locations of net, let 
AR(net) be the aspect ratio of MBB(net), and let L(net) be the length of net. 



Rule 1: For two nets ; andy, if AR(i) > AR(J), then / is routed before y (Fig. 7.4). 
Rationale: nets with square MBBs tend to have greater routing flexibility than nets 
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with tall or wide bounding boxes (all straight nets have AR = oo). If AR(i) = AR(J), 
then ties can be broken by net length, i.e., if Z,(;) < L(j), then / is routed before y. 
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Fig. 7.4 Net ordering based on the aspect ratio of 
the nets' bounding boxes. Net A has higlier aspect 
ratio (AR(A) = co) and results in shorter total 
whelength (left), while routing net B first results in 
greater total wirelength. 



Rule 2: For two nets / andy, if the pins of / are contained within MBB{j), then ; is 
routed beforey (Fig. 7.5). Ties can be broken by the number of routed nets not fully 
contained in the MBB. 
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Fig. 7.5 Net ordering based on the pin locations inside the bounding boxes of the nets. The 
first-routed net has no pins contained within its MBB. Starting with net D, there are two potential 
net orderings: D-A-C-B and D-C-A-B. (a) Nets A-D with their MBBs. (b) Routing net D fnst. (c) 
Routing net C and then net A or net A and then net C. (d) Routing net B. 

Rule 3: Let Yl{net) be the number of pins within MBB(net) for net net. For two nets i 
and 7, if n(/) < Yl(J), then / is routed before y. That is, the net that is routed first has 
the smaller number of pins from other nets within its bounding box (Fig. 7.6). Ties 
are broken based on the number of pins that are contained within the bounding box. 
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Fig. 7.6 Finding the net ordering based on the number of pins of nets within its MBB. (a) Nets A-E 
with their MBBs. (b) Net D is routed first because its MBB contains no pins, (c) Net C is routed 
because it contains one pin, and net E is routed next because it contains two pins, (d) Nets B and A 
are routed next, so the result is D-C-E-B-A. Note that this example cannot be routed with sequential 
net ordering on a Manhattan grid. 
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7.3 Non-Manhattan Routing 



7.3 



Recall from Sec. 7.1 that traditional Manhattan routing allows only vertical and 
horizontal segments. Shorter paths are possible with diagonal segments. However, 
arbitrary diagonal segments cannot be effectively manufactured. A possible 
compromise is to allow 45-degree or 60-degree segments in addition to horizontal 
and vertical segments. Such non-orthogonal routing configurations are commonly 
described by ^^-geometry, where X represents the number of possible routing 
directions' and the angles n/Xat which they can be oriented. 

— 1 = 2 (90 degrees): Manhattan routing (four routing directions) 

— X^3 (60 degrees): Y-routing (six routing directions) 

X = 4 (45 degrees): X-routing (eight routing directions) 

The advantages of the latter two routing styles over Manhattan-based routing are 
decreased wirelength and via count. However, other steps in the physical design 
flow, such as physical verification, could take significantly longer. Additionally, 
non-Manhattan routing becomes prohibitively difficult at recent technology nodes 
due to limitations of optical lithography. Therefore, non-Manhattan routing is 
primarily employed on printed circuit boards (PCBs). This is illustrated by octilinear 
route planning in Sec. 7.3. 1 and eight-directional path search in Sec. 7.3.2. 

► 7.3.1 Octilinear Steiner Trees 

Octilinear Steiner minimum trees (OSMT) generalize rectilinear Steiner trees by 
allowing segments that extend in eight directions. The inclusion of diagonal 
segments gives more fi'eedom when placing Steiner points, which may reduce total 
net length. Several OSMT algorithms have been proposed, such as in [7.9] and 
[7.19]. The following approach was developed by Ho et al. [7.9] (refer to the 
pseudocode on the next page). 

First, find the shortest three-pin subnets of the net under consideration. To identify 
these three-pin groups, the Delaunay triangulation^ is found over all pins (line 2). 
Second, sort all the groups in ascending order of their minimum octilinear routed 
lengths (line 3). Then, integrate these three-pin subnets into the overall OSMT. For 
each group subTm sorted order (line 4), (1) route suhTW\t\\ the minimum octilinear 
length (line 5), (2) merge subT with the current octilinear Sterner tree OST (line 6), 
and (3) locally optimize 05T based on subT{\me, 7). 



Not to be confused with the layout-scahng parameter X. 
" The Delaunay triangulation for a set of points /" in a plane is a triangulation DT(P) such that no 
points inP lie inside the circumcircle of any triangle mDT(P). The circumcircle of a triangle tri is 
defined as a circle which passes through all the vertices oitri. 
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Octilinear Steiner Tree Algorithm [7.9] 

Input: set of all pins P and their coordinates 

Output: heuristic octilinear minimum Steiner tree OST 

1. OST=0 

2. 7 = set of all three-pin nets of P found by Delaunay triangulation 

3. sortedT = SORT(7,minimum octilinear distance) 

4. for(/= 1 to|sortec/Tl) 

5. subT = ROUJE{sorteclT[i\) // route minimum tree over sub 7" 

6. ADD{OST,subT) // add route to existing tree 

7. IMPROVE(OST',su£)7) // locally improve OST" based on sub 7" 



Example: Octilinear Steiner Tree 

Given: pins PrPn- 

P,(2,17) P2(l,14) ^3 (11,15) ^4(4,11) 
P5 (14,12) P^(2,9) P7(ll,9) P8(12,6) 
P9(16,6) P,o(7,4) P„(3,l) P,2(14,l) 

Task: find a heuristic octilinear Steiner minimum tree. 



Solution: 

Find all three-pin subnets using Delaunay triangulation. Find 
the minimum octilinear Steiner tree cost for each subnet and 
sort the subnets in ascending order of this cost. 



^(^2,4,6) = 7.0 

^(^3,5,7) = 8.4 
i(r4,6,io) = 9.6 

^(^6,10,11)= 11.8 
i(r,o,u,i2)=13.4 



^(^7,8,9) = 

i( 78,9, 12) ' 
^(^5,7,9) = 
^(73,4,7) = 

i(r,,3,4) = 



7.4 
= 8.6 
9.6 
12.6 
13.8 



i(r,,2,4) = 7.6 
^(7'7,8,io) = 8.6 

A7'8,10,12)=10.6 

1(74,7,10) =12.6 



Add 72,4,6 to OST. No optimization is necessary because it is 
the first merged subtree. This is also the case for trees Tj g 9, 
ri,2,4, and 1357. Merging 78,9,12 causes a cycle between the 
new Steiner point in 78,9,i2, Pg and P9. 



To resolve this confiict, find the minimum octilinear distance 
needed to connect the three new points. The minimal tree 

V2 2 
2 +2 = 5.7, which 

results in smaller wirelength than the horizontal segment 

connecting Pg and P9 plus a vertical segment, 4 + 2 = 6. 
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Continue merging the remaining subtrees. The final heuristic 
octilinear minimum Steiner tree is constructed after merging 
all subtrees. 
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^ 7.3.2 Octilinear IVIaze Search) 

Route planning based on OSMTs assumes octilinear detailed routing. One approach 
is based on Lee's wave -propagation algorithm [7.13] where instead of only 
expanding nodes in four directions, eight directions are used, including diagonals. 
The wave propagation begins at s and marks all previously-unvisited neighbors with 
index 1 (Fig. 7.7(a)). From each node with index 1, the propagation expands again, 
and each previously-unvisited neighbor is marked with index 2 (Fig. 7.7(b)). Such 
iterations continue until the wave reaches the target t or no further propagation is 
possible. After t is reached, a path is backtraced by stepping to next smallest index 
until s is reached (Fig. l.l{c)). 
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Fig. 7.7 Octilinear maze search with Lee's algorithin. (a) The initial outward pass from the source 
node s. (b) The second pass, branching out in all eight directions, (c) The third pass, where the 
target node t has been found, and a path is traced back from t to s. 
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7.4 



Most digital designs are synchronous - computation progresses when current values 
of intemal state variables and input variables are fed to combinational logic 
networks, which then generate outputs as well as the next values of the state 
variables. A clock signal or "heartbeat", required to maintain synchronization of all 
computation that takes place across the chip, may be generated off-chip, or by 
special analog circuitry such as phase-locked loops or delay-locked loops. Its 
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frequency may be divided or multiplied depending on the needs of individual 
blocks. Once the clock signal entry points and sinks (flip-flops and latches) are 
known, clock tree routing generates a clock tree for each clock domain of the circuit. 

The special role of the clock signal in synchronizing all computations on the chip 
makes clock routing very different from the other types of routing (Chaps. 5-6). The 
crux of the clock routing problem is that the signal must be delivered from the 
source to all the destinations, or sinks, at the same time. Advanced algorithms for 
clocking can be found in Chaps. 42-43 of [7. 1] and in books [7. 15] and [7. 18]. 

► 7.4.1 Terminology 

A clock routing problem instance {clock net) is represented by « + 1 terminals, 
where ^o is designated as the source, and S = {suS2, ■ ■ ■ ,s„] is designated as sinks. 
Let Sj, 0<i<n, denote both a terminal and its location. 

A clock routing solution consists of a set of wire segments that connect all of the 
terminals of the clock net, so that a signal generated at the source can be 
propagated to all of the sinks. The clock routing solution has two aspects - its 
topology and its embedding. 

The clock tree topology {clock tree) is a rooted binary tree G with n leaves 
corresponding to the set of sinks. Internal nodes of the topology correspond to the 
source and any Steiner points in the clock routing. 

The embedding of a given clock tree topology provides exact physical locations of 
the edges and internal nodes of the topology. Fig. 7.8(a) shows a six-sink clock 
tree instance. Fig. 7.8(b) shows a connection topology, and Fig. 7.8(c) shows a 
possible clock tree solution that embeds that topology. 
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Fig. 7.8 (a) A clock tree routing instance, (b) Connection topology, (c) Embedding. 



Assume that the clock tree T is oriented such that the source terminal sq is the root 
of the tree. Each node v e T is connected to its parent p ^ Thy edge {p,v). The 
cost of each edge e e T is its wirelength, denoted by \e\. Let the cost of T, denoted 
by cost{T), be the sum of its (embedded) edge costs. 
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Signal delay is the time required for a signal transition (low to high, or high to 
low) to propagate from one node to another node, e.g., in a routing tree. Signal 
transitions are initiated at the outputs of logic gates, which are constructed from 
transistors that have highly nonlinear behavior. The transitions propagate through 
complex wire and via structures that have parasitic resistances, capacitances and 
inductances. Hence, it is difficult to exactly calculate signal delay. Circuit 
simulators such as SPICE, or commercial timing analysis tools such as 
PrimeTime, are used to obtain accurate "signoff delay" calculations during the 
final checking steps before the design is sent to production. However, to guide 
place-and-route algorithms, considerably less accuracy is needed. Two common 
signal delay estimates used in timing-driven routing are the linear and Elmore 
delay models. The following is a reproduction of the development given in [7. 11]. 

In the linear delay model, signal delay from Si to Sj is proportional to the length of 
the Sj- Sj path in the routing tree and is independent of the rest of the connection 
topology. Thus, the normalized linear delay between any two nodes u and w in a 
source-sink path is the sum of the edge lengths \e\ in the u~w path 



ho 



{u,w)= ^|e| 



On-chip wires are passive, resistive-capacitive (RC) structures, for which both 
resistance {R) and capacitance (C) typically grow in proportion to the length of the 
wire. Thus, the linear delay model does not accurately capture the "quadratically 
growing" RC component of wire delay. On the other hand, the linear 
approximation provides reasonable guidance to design tools, especially for older 
technologies that have smaller drive resistance of transistors and larger wire 
widths (smaller wire resistances). In practice, the linear delay model is very 
convenient to use in EDA software tools because of its ease of evaluation. 

In the Elmore delay model, given the routing tree T with root (source) node Sq, 

— (/?,v) denotes the edge connecting node v to its parent node/* in T 

— R{e) and C(e) denote the respective resistance and capacitance of edge e e T 

— T^ denotes the subtree of T rooted at v 

— C(v) denotes the sink capacitance of v 

— C{T^ denotes the tree capacitance of T,,, i.e., the sum of sink and edge 
capacitances in T^. 

If node V is a terminal, then C(v) is typically the capacitance of the input pin to 
which the clock signal is routed. If node v is a Steiner node, then C(v) = 0. If T, is 
a single (leaf) node, C(r,,) is equal to v's sink capacitance C(v). 
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Using this notation, the Elmore delay approximation for an edge (p,v) is 

t,^(p,v) = R{p,v)-(^^^;^ + C(v) 



This can be seen as a sum of RC delay products, with the factor of one -half 
corresponding to a -63% threshold delay. Last, if ^^ denotes the on-resistance of 
the output transistor at the source ("stronger" driving gates will have smaller 
on-resistance values RJ), then the Elmore delay tEois) for sink s 

ee(s„,s) 

Physical design tools use the Elmore delay approximation for three main reasons. 
First, it accounts for the sink delay impact of off-path wire capacitance - the edges 
of the routing tree that are not directly on the source-to-sink path. Second, it offers 
reasonable accuracy and good fidelity (correlation) with respect to accurate delay 
estimates from circuit simulators. Third, it can be evaluated at all nodes of a tree 
in time that is linear in tree size (number of edges). This is realized by two 
depth-first traversals: the first calculates the tree capacitance C{TJ) below each 
node in the tree, while the second calculates the delays from the source to each 
node [7.11]. 

Clock skew is the (maximum) difference in clock signal arrival times between 
sinks. This parameter of the clock tree solution is important, since the clock signal 
must be delivered to all sinks at the same time. If t{ii,v) denotes the signal delay 
between nodes u and v, then the skew of clock tree Tis 

skew{T)= max \t{sQ,Si)-t{sQ,Sj)\ 

If there exists a path of combinational logic from the (data) output pin of one sink 
to the (data) input pin of another sink, then the two sinks are said to be related or 
sequentially adjacent. Otherwise, the two sinks are unrelated. 

Local skew is the maximum difference in arrival times of the clock signal at the 
clock pins of two or more related sinks. 

Global skew is the maximum difference in arrival times of the clock signal at the 
clock pins of any two (related or unrelated) sinks - i.e., the difference between 
shortest and longest source-sink path delays in the clock distribution network. In 
practice, skew typically refers to global skew. 
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► 7.4.2 Problem Formulations for Clock-Tree Routing 

This section presents some basic clock routing formulations. The most fundamental 
is the zero-skew tree problem. Practical variations include the bounded-skew tree 
and useful-skew tree problems. The integration of zero-skew trees in a modem, 
low-power clock-network design flow is further discussed in Sec. 7.5.2, with more 
details in [7.14]. It relies on SPICE - software for circuit simulation - for a high 
degree of accuracy. 

Zero skew. If a clock tree exhibits zero skew, then it is a zero-skew tree (ZST). For 
skew to be well-defined, a delay estimate (e.g., linear or Elmore delay) is imphcit. 

Zero-Skew Tree (ZST) Problem. Given a set S of sink locations, construct a ZST 
T{S) with minimum cost. In some contexts, a connection topology G is also given. 

Bounded skew. While the ZST problem leads to elegant physical design 
algorithms that form the basis of commercial solutions, practical clock tree routing 
does not typically achieve exact zero skew. 

In practice, a "true ZST" is not desirable. ZSTs can use a significant amount of 
wirelength, increasing the total capacitance of the network. Moreover, a true ZST 
is also not achievable in practice - manufacturing variability for both transistors 
and interconnects can cause differences in the RC constants of wire segments of a 
given layer. Thus, signoff timing analysis is with respect to a non-zero skew bound 
that must be achieved by the clock routing tool. 

Bounded-Skew Tree (EST) Problem. Given a set S of sink locations and a skew 
bound UB > 0, construct a clock tree T{S) with skew{T{S)) < UB having minimum 
cost. As with the ZST problem, in certain contexts a topology G may be specified. 
Notice that when the skew is unbounded {UB = oo), the BST problem becomes the 
classic RSMT problem (Chap. 5). 

Useful skew. Clock trees do not always require bounded global skew. Correct chip 
timing only requires control of the local skews between pairs of related flip-flops 
or latches. While the clock tree routing problem can be conveniently formulated in 
terms of global skew, this actually over-constrains the problem. The increasingly 
prominent useful skew formulation is based on analysis of local skew constraints. 

In synchronous circuits, the data signal that propagates from a flip-flop (sink) 
output to the next flip-flop input should arrive neither too late nor too early. The 
former failure mode (late arrival) is zero clocking, while the latter (early arrival) is 
double clocking [7.7]. In contrast to formulations that minimize or bound global 
skew, Fishburn [7.7] proposed a clock skew optimization method that introduces 
useful skew - perturbing clock arrival times at sinks - in the clock tree to either 
minimize the clock period or maximize the clock safety margin. The clock period 
P can be reduced by appropriate choices of sink arrival times (Fig. 7.9). 
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(b) IVIinimum clock period 

P = 4 ns with 2 ns (useful) skew 

P > 2 ns - (0 ns - 2 ns) = 4 ns 
P > 6 ns - (2 ns - ns) = 4 ns 



Fig. 7.9 Example of useful skew for clock cycle time reduction, (a) Zero skew results in a 6 ns 
clock period, (b) Useful skews of 2 ns, ns and 2 ns at xi , X2 and X3 result in a 4 ns clock period. 

To avoid zero clocking, the data edge generated by FFj due to a clock edge must 
arrive at FFj no later than tsemp before the earliest arrival of the next clock edge. 
Formally, x, + tsemp + rnax(/,y) < X/ + P must be met with clock period P, where 

— X, is the latest time at which the clock edge can arrive at FFj 

— max(/,y) is the slowest (longest) signal propagation from FFj to FFj 

— Xj + P is the earliest arrival time of the next clock edge at FFj 

To avoid double clocking between two flip-flops FFi and FFj, the data edge 
generated at FFj due to a clock edge must arrive at FFj no sooner than ?;,„« after 
the latest possible arrival of the same clock edge. Formally, x, + min(;,y) > x, + thou 
must be met, where 

— X, is the earliest time at which the clock edge can arrive at FFj 

— min(ij) denote the fastest (shortest) signal propagation from FF, to FFj 

— Xj be the latest arrival time of the clock at FFj 

Fishburn observed that linear programming can be used to find optimal clock arrival 
times X, at all sinks to either (1) minimize clock period {LP SPEED), or (2) 
maximize the safety margin (LPSAFETY). 

Useful Skew Problem {LP SPEED). Given (1) constant values of tsemp and 4o/rf, (2) 
maximum and minimum signal propagation times \nax{ij) and mmiij) between all 
pairs (y) of related sinks, and (3) minimum source-sink delay ?„„,„ determine clock 
arrival times x, for all sinks to minimize clock period P, subject to the following 
constraints. 



Xi-Xj>thoid-rmn(i,j) 

Xj -Xi + P> t,e,up + max(;,y) 



X,- > t,r, 



for all related {i,j) 
for all related {i,j) 
for all i 
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Useful Skew Problem {LP SAFETY). Given (1) constant values of tsewp and ti,ou, (2) 
maximum and minimum signal propagation times n\ax{ij) and n\m(ij) between all 
pairs (y) of related sinks, and (3) iTiinimum source-sink delay f„„,„ determine clock 
arrival times x, for all sinks to maximize safety margin SM, subject to 



X, - xj -SM> hold - min(/,7) 


for all related {i,j) 


Xj - Xi -SM> t,,u,p + max(y) - P 


for all related {ij) 


X > f 

•^1 — ''mm 


for all i 
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Clock trees play a vital role in modem synchronous designs and significantly impact 
the circuit's performance and power consumption. A clock tree should have low 
skew, simultaneously delivering the same signal to every sequential gate. After the 
initial tree construction (Sec. 7.5.1), the clock tree undergoes clock buffer insertion 
and several subsequent skew optimizations (Sec. 7.5.2). 

^ 7.5.1 Constructing Trees with Zero Global Skew 

This section presents five early algorithms for clock tree construction whose 
underlying concepts are still used in today's commercial EDA tools. Several 
scenarios are covered, including algorithms that (1) construct a clock tree 
independent of the clock sink locations, (2) construct the clock tree topology and 
embedding simultaneously, and (3) construct only the embedding given a clock tree 
topology as input. 

H-tree. The H-tree is a self-similar, fractal structure (Fig. 7.10) with exact zero skew 
due to its symmetry. It was first popularized by Bakoglu [7.2]. In the unit square, a 
segment is passed through the root node at center, then two shorter line segments are 
constructed at right angles to the first segment, to the centers of the four quadrants; 
this process continues recursively until the sinks are reached. The H-tree is 
frequently used for top-level clock distribution, but cannot be employed directly for 
the entire clock tree due to (1) blockages, (2) irregularly placed clock sinks, and (3) 
excessive routing cost. That is, to reach all n = 4* sinks uniformly located in the unit 
square, where A; > 1 is the number of levels in the H-tree, the wirelength of the 

H-free grows as 3-v/n / 2 . To minimize signal reflections at branching points, the 
wire segments can be tapered - halving the wire width at each branching point 
encountered as one moves away from the source. 
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Fig. 7.10 An H-tree. 



Method of Means and Medians (MMM). The method of means and medians 
(MMM) was proposed by Jackson, Srinivasan andKiih in 1990 [7.10] to overcome 
the topology limitations of the H-tree. MMM is applicable even when the clock 
terminals are arbitrarily arranged. The basic idea is to recursively partition the set of 
terminals into two subsets of equal cardinality (median). Then, the center of mass of 
the set is connected to the centers of mass of the two subsets (mean) (Fig. 7. 1 1). The 
basic MMM algorithm is described below. 



Basic Method of Means and Medians (BASIC_MMM(S,7)) 

Input: set of sinks S, empty tree T 

Output: clock tree T 

1. 

2. 

3. 

4. 

5. 

6. 

7. 

8. 

9. 

10. 



if(|S|<1) 
return 

(xo.yo) = {XciS),yc{S)) 
{Sa,Sb) = PARTITION(S) 
(xA,yA) = (Xc(S^),yc(S/\)) 
(xe,ys) = (Xc(Ss),yc(Ss)) 
ROUTE(7,Xo,yo,x^,y^) 
ROUTE(7,xo,yo,xe,ys) 
BASIC_IVIIVIIVI(S^,7) 
BASIC_IVIIVIIVI(Ss,r) 



// center of mass for S 

// median to determine Sa and Se 

// center of mass for Sa 

II center of mass for Sb 

II connect center of mass of S to 

// center of mass of Sa and Sb 

II recursively route Sa 

II recursively route Se 



Let {xJ(S),yc{S)) denote the x- and j^- coordinates for the center of mass of the point 
set S, defined as 



E'. 



*,[S)- 



i=\ 



S^' 



and y,{S): 



1=1 



Sa and Sb are the two equal-cardinality subsets obtained when S is partitioned by a 
median. 



While the MMM strategy may be simplified to a top-down H-tree construction, 
the clock skew is minimized only heuristically. The maximum difference between 
two source-sink pathlengths can be very large (up to diameter of the chip) in the 
worst case. Thus, the algorithm's effectiveness depends heavily on the choice of 
cut directions for median computation. 
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(a) 



(b) 



(c) 




Fig. 7.11 Illustration of the main steps of the method of means and medians (MMM). (a) Find the 
center of mass (white 'o') of the point set S (black 'o'). (b) Partition S by the median, (c) Find the 
center of mass for the left and right subsets of 5. (d) Route (connect) the center of mass of 5 with 
the centers of mass of the left and right subsets of 5. (e) Final result after recursively performing 
MMM on each subset. 

Recursive Geometric Matcliing (RGM). The recursive geometric matching 
(RGM) algorithm [7.12] was proposed in 1991. Whereas MMM is a top-down 
algorithm, RGM proceeds in a bottom-up fashion. The basic idea is to recursively 
find a minimum-cost geometric matching of n sinks - a set of « / 2 line segments 
that connect n endpoints pairwise, with no two line segments sharing an endpoint 
and minimal total segment length. After each matching step, a balance or tapping 
point is found on each matching segment to preserve zero skew to the associated 
sinks. The set of « / 2 tapping points then fonns the input to the next matching step. 
The algorithm is illustrated in Fig. 7. 12. 

Formally, let T denote a rooted binary tree, let S denote a set of points, and let M 
denote a set of matching pairs <PiJ^j> over S. The clock entry point {CEP) denotes 
the root of the clock tree, i.e., the location from which the clock signal is propagated. 



Recursive Geometric IVIatching Algorithm 
Input: set of sinks S, empty tree T 
Output: clock tree T 

1. if(|S|<1) 

2. return 

3. M= min-cost geometric matching over S 

4. S'=0 

5. foreach (<Pi,Pj> e M) 

6. Tp. = subtree of T rooted at P, 

7. Tpj = subtree of T rooted at P, 

8. fp = tapping point on (P/,Py) 

9. ADD(S',fp) 

10. ADD(T-,(P/,Py)) 

11. if(|S|%2==1) 

12. ADD(S', unmatched node) 

13. RGIV1(S',7) 



// point that minimizes the skew of 
// the tree Tip = Tp. U 7p. U (P/,Py) 
// add tp to S' 

// add matching segment (PiPy) to T 
II if \S\ is odd, add unmatched node 

// recursively call RGIVl 
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Fig. 7.12 Illustration of the recursive geometric matching (RGM) algorithm, (a) Set of « sinks S. 
(b) Min-cost geometric matching on n sinks, (c) Find balance or tapping points (gray 'o') for 
each n / 2 line segments. Notice that each tapping point is not necessarily the midpoint of the 
matching segment; it is the point that achieves zero skew in the subtree, (d) Min-cost geometric 
matching on « - 2 tapping points, (e) Final result after recursively performing RGM on each 
newly generated set of tapping points. 

In practice, RGM improves clock tree balance and wirelength over MMM. Fig. 7. 13 
compares the results from (a) MMM and (b) RGM on a set of fr)ur sinks. For this 
instance, when MMM makes an unfortunate choice of cut direction (median 
computation), RGM can reduce wirelength by a factor of two. However, as with the 
MMM algorithm, RGM does not guarantee zero skew. In particular, if two subtrees 
have very different source-sink delays and their roots are matched, then it may not 
be possible to find a zero-skew tapping point on the matching segment. 



II II 
S3 S4 




Fig. 7.13 Wirelength comparison between MMM (left) and RGM (right) on four sinks. 

Exact Zero Skew. The exact zero skew algorithm, proposed in 1991 [7.17], adopts 
a bottom-up process of matching subtree roots and merging the corresponding 
subtrees, similar to RGM. However, this algorithm features two important 
improvements. First, it finds exact zero-skew tapping points with respect to the 
Elmore delay model rather than the linear delay model. This makes the result more 
useful, with smaller actual clock skew in practice. Second, it maintains exact delay 
balance even when two subtrees with very different source-sink delays are matched. 
This is accomplished by elongating wires as necessary to equalize source-sink 
delays. 



When the roots of two subtrees are matched and the subtrees are merged, a 
zero-skew merging (tapping) point is determined (Fig. 7.14). In the figure, tp 
indicates the position of the zero-skew tapping point along the matching segment of 
length L(si,S2) between nodes si and sj- 
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Tapping point tp 




Subtree 7", Subtree T, 



^ Tapping point tp- 



R{w,) 



1 -z 



± 2 + 2 + 

R{w,) 






- 2 



2 * 



Fig. 7.14 Finding a zero-skew tapping point when two subtrees are merged. Tlie tapping point tp, 
located on ttie segment connecting both subtrees, is where the Elmore delay to sinks is equalized. 

To achieve zero skew, the delays from the tapping point tp to the sinks of T,,, and 
from the tapping point tp to the sinks of T,,, must be the same. Thus, 



tEoitP) = Ri^l) 



^^ + CisM + t,^{T,^) = Riw,)-\^^ + Cis2)\ + t,o{TJ 



where Wi is the wire segment connecting S] to tp, Wi is the wire segment connecting 
tp to $2, Ri^i) and C{w\) denote the respective resistance and capacitance of wi, 
R{w2} and C(w2) denote the respective resistance and capacitance of W2, C{s\) is the 
capacitance of S\, C{s2) is the capacitance of ^2, tEiiTs) is the Elmore delay of 
subtree T^^, and tgiiT,.^ is the Elmore delay of subtree T,,. 

The wire resistance and capacitance of W\ are found by multiplying the respective 
unit resistance a and capacitance P to the distance from s\ to tp. 

R{w\) = a ■ z ■ L{suS2) and C{w\) = P ■ z ■ L{s\,S2} 

Similarly, the wire resistance and capacitance of W2 are found by multiplying the 
respective unit resistance a and capacitance P with the distance from tp to S2- 

R{w2) = a ■ (1 -z) ■ L{suS2) and C(w2) = P ■ (1 -z) ■ L{s\,S2} 

Combining the above equations yields the position of the zero-skew tapping point 

^■L{Sy,S2y 



{t{T,^_)-t{T,^)) + a-L{s„S2)-\C{s2) 

a-L{s„S2)-{^-L{s„S2) + C{s,) + C{s2)) 



When < z < 1, the tapping point is located along the segment connecting the roots 
of the two subtrees. Otherwise, the wire must be elongated to meet the zero-skew 
condition. 
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Deferred-Merge Embedding (DME). The deferred-merge embedding (DME) 
algorithm defers the choice of merging (tapping) points for subtrees of the clock 
tree. DME optimally embeds any given topology over the sink set S: the embedding 
has minimum possible source-sink linear delay, and minimum possible total tree 
cost. While the preceding methods MMM and RGM each require only a set of sink 
locations as input, variants of DME needs a tree topology as input. The algorithm 
was independently proposed by several groups - Boese and Kahng [7.3], Chao et al. 
[7.4], and Edahiro [7.6]. 

A fiindamental weakness of the preceding algorithms is that they decide the 
locations of internal nodes of the clock tree very early - before intelligent decisions 
are even possible. Once a centroid is determined in MMM, or a tapping point in 
RGM or an exact zero skew tree, it is never changed. Yet, in the Manhattan 
geometry, two sinks in general position will have an infinite number of midpoints, 
creating a tilted line segment, ox Manhattan arc (Fig. 7.15); each of these midpoints 
affords the same minimum wirelength and exact zero skew. Ideally, the selection of 
embedding points for internal nodes will be delayed for as long as possible. 
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Fig. 7.15 The locus of all midpoints between two sinks ^i and ^2 is a Manhattan arc in the 
Manhattan geometiy. On the other hand, the midpoint is unique in Euclidean geometiy. (a) Sinks S\ 
and Sj are not horizontally aligned. Therefore, the Manhattan arc has non-zero length, (b) Sinks S\ 
and Si are horizontally (left) and vertically (right) aligned. Therefore, the Manhattan arc for both 
cases has zero length. 

The DME algorithm embeds internal nodes of the given topology G via a two-phase 
process. The first phase of DME is bottom-up, and determines all possible locations 
of internal nodes of G that are consistent with a minimum-cost ZST T. The output of 
the first phase is a "tree of line segments", with each line segment being the locus of 
possible placements of an intemal node of T. The second phase of DME is 
top-down, and chooses the exact locations of all intemal nodes in T. The output of 
the second phase is a fully embedded, minimum-cost ZST with topology G. 



In the following, let S denote the set of sinks {s\, $2, ... , s„}, and let sq denote the 
clock source. Let pl{v) denote the location of a node v in the output clock tree T. 
Several terms are specific to the Manhattan geometry. As already noted, a 
Manhattan arc is a tilted line segment with slope ±1. If the two sinks of Fig. 7. 15 are 
horizontally or vertically aligned, there is only one possible midpoint, i.e., a 
zero-length Manhattan arc. 
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Fig. 7.16 (a) Sinks S[ and S2 form a Manhattan arc. (b) An example of a tilted rectangular region 
(TRR) for the Manhattan arc of .Si and Sj with radius of two units. 

A tilted rectangular region (TRR) is a collection of points within a fixed distance of 
a Manhattan arc (Fig. 7.16). The core of a TRR is the subset of its points at 
maximum distance from its boundary, and the radius of a TRR is the distance 
between its core and boundary. 

The merging segment of a node v in the topology, denoted by ms{v), is the locus of 
feasible locations for v, consistent with exact zero skew and minimum wirelength 
(Fig. 7.17). The following presents the sub-algorithms used for DME. 
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Fig. 7.17 A bottom-up constmction of the merging segment ms{u-i) for node u^, the parent of nodes 
Wi and «2, given the topology on the left. The sinks s^ and Si form the merging segment ms{U[), and 
the sinks .S3 and .S4 forni the merging segment ms{u2). The two segments ms{u\) and ms(u2) together 
form the merging segment ms(u^). 



The bottom-up phase of DME (Build Tree of Segments algorithm) starts with all 
sink locations S given. Each sink location is viewed as a (zero-length) Manhattan arc 
(lines 2-3). If two sinks have the same parent node 11, then the locus of possible 
placements of m is a merging segment (Manhattan arc) ms(u). In general, given the 
Manhattan arcs that are the merging segments of two nodes a and b, the merging 
segment of their parent node is uniquely determined due to the minimum-cost 
property, and is itself another Manhattan arc (Fig. 7.17). The edge lengths \ea\ and 
\et,\ are uniquely determined by the minimum-length and zero-skew requirements 
(lines 5-11). As a result, the entire tree of merging segments can be constructed 
bottom-up in linear time (Fig. 7. 18). 
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Build Tree of Segments Algorithm (DME Bottom-Up Phase) 

Input: set of sinks and tree topology G{S,Top) 

Output: merging segments ms{v) and edge lengths \ev\,veG 



1. 

2. 

3. 

4. 

5. 

6. 

7. 

8. 

9. 

10. 

11. 



foreach (node v e G,\r\ bottom-up order) 



if (i/is a sink node) 

ms[v] = PL{v) 
else 

(a,ft) = CHILDREN(i/) 

CALC_EDGE_LENGTH(ea,eb) 

trr[a][core] = MS{a) 

trr[a][raclius] = \ea\ 

trr[b][core] = MS{b) 

trr[b][radius] = |et,| 

mslv] = trr[a] n trr[b] 



II if 1/ is a terminal, then ms{v) is a 

// zero-length IVianhattan arc 

// otherwise, if v is an internal node, 

// find i/s children and 

// calculate the edge length 

// create trr(a) - find merging segment 

// and radius of a 

// create trr(b) - find merging segment 

// and radius of £) 

// merging segment of v 
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Fig. 7.18 Construction of a tree of merging segments (DME bottom-up pliase). Solid lines are 
merging segments, dotted rectangles are tlie tilted rectangular regions (TRR), and dashed lines 
are edges between merging segments; .So is the clock source, and S\-st, are sinks, (a) The eight 
sinks and the clock source, (b) Construct merging segments for the eight sinks, (c) Construct 
merging segments for the segments generated in (a), (d) Construct the root segment, the merge 
segment that connects to the clock source. 
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Find Exact Locations (DIVIE Top-Down Phase) 

Input: set of sinks S, tree topology G, outputs of DIVIE bottom-up phase 

Output: minimum-cost zero-skew tree 7 with topology G 

1 . foreach (non-sink node i/ e G in top-down order) 

2. if {v is the root) 

3. loc = any point in ms{v) 

4. else 

5. par =PARENT(i') 

6. trr\par][core] = PL{par) 

7. trr\par][raclius] = \ev\ 

8. loc = any point in ms[v] D tr/\par] 

9. pl[v] = loc 



II par is the parent of v 

II create trr{par) - find merging segment 

// and radius of par 



In the DME top-down phase (Find Exact Locations algorithm), exact locations of 
internal nodes in G are determined, starting with the root. Any point on the root 
merging segment from the bottom-up phase is consistent with a minimum-cost ZST 
(lines 2-3). Given that the location of a parent node par has already been chosen in 
the top-down processing, the location of its child node v is determined from two 
known quantities: (1) \ej\, the edge length from v to its parent /lar, and (2) ms{v), the 
locus of placements for v consistent with a minimum-cost ZST (lines 6-8). The 
location of V, i.e.,/»/(v), can be determined as illustrated in Fig. 7.19. 



Possible locations of v 



trr{par) - 
ms{v) - 



■ Imparl 

■plipar) 



Fig. 7.19 Finding tlie location of 
child node v given the location 
of its parent node par. 



Thus, the embeddings of all internal nodes of the topology can be determined, 
top-down, in linear time (Fig. 7.20). 



The DME algorithm can also be applied to the BST (bounded-skew tree) problem 
[7.5], given a straightforward generalization from merging segments to merging 
regions. Each merging region in the bounded-skew DME construction is bounded 
by at most six segments having slopes +1, 0, -1, or +oo (one of four possibilities). A 
related exercise is given at the end of this chapter. 
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Fig. 7.20 Embedding the clock tree during tlie DME top-down phase. Gray lines indicate 
merging segments, dotted lines show connections between merging segments, and black lines 
indicate routing segments, (a) Connecting the clock source to the root merging segment, (b) 
Connecting the root merging segment to its children merging segments, (c) Connecting those 
merging segments to its children, (d) Connecting those merging segments to the sinks. 



► 7.5.2 Clock Tree Buffering in the Presence of Variation 



High-perforaiance designs require low clock skew, while their clock networks 
contribute significantly to power consumption. Therefore, tradeoffs between skew 
and total capacitance are important. Skew optimization requires highly accurate 
timing analysis. Simulation-based timing analysis is often too time-consuming, 
while closed-fomi delay models are too inaccurate. A commonly used compromise 
is to first perform optimization based on the Elmore delay model and then tune the 
tree with more accurate optimizations. For example, a difference of 5 picoseconds 
(ps) is only a 1% error relative to a 500 ps sink latency, but is a 50% error relative to 
a lOps skew in the same tree. To address challenging skew constraints, a clock tree 
undergoes several optimization steps, including (1) geometric clock tree 
construction (Sec. 7.5.1), (2) initial clock buffer insertion, (3) clock buffer sizing, 
(4) wire sizing, and (5) wire snaking. In the presence of process, voltage, and 
temperature {PVT) variations, such optimizations require accurate models of the 
impacts of these variations. 
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High-level skew optimization. In the 1980s, the entire clock tree could be driven by 
a single buffer. However, with technology scaling, wires have become more 
resistive and clock trees today can no longer be driven by a single buffer. Therefore, 
clock buffers are inserted at multiple locations in the tree to ensure that the clock 
signal has sufficient strength to propagate to all sinks (timing points). The locations 
and sizes of buffers are used to control the propagation delay within each branch of 
the tree. Though not intended for clock trees, the algorithm proposed by L. van 
Ginneken [7.8] optimally buffers a tree to minimize Elmore delay between the 
source and every sink in <9(w^) time, where n is the total number of possible buffer 
locations. The 0{n log «)-time variant proposed in [7.16] is more scalable. These 
algorithms also avoid insertion of unnecessary buffers on fast paths, thus achieving 
lower skew if the initial tree was balanced. After initial buffer insertion, subsequent 
optimizations are performed to minimize skew, decrease power consumption, and 
improve robustness of the clock tree to unforeseen changes to buffer characteristics, 
e.g., manufacturing process variations. 

Clock buffer sizing. The choice of the clock buffer size for initial buffer insertion 
affects downstream optimizations, as most of the buffers' sizes and locations are 
unlikely to change. However, the best-performing size is difficult to identify 
analytically. Therefore, it is often determined by trial and error, e.g., using binary 
search. Clock buffer sizes can be fiirther adjusted as follows. For a pair of sinks Si 
and Si with significant skew, find the unique path Ji in the tree connecting Si and 52- 
Upsize the buffers on n according to a pre-computed table (discussed below) that 
matches an appropriate buffer size to each path length and fanout. In practice, larger 
buffers improve the robustness of the circuit, but consume more power and may 
introduce additional delay. 

Wire sizing. The choice of wire width affects both power and susceptibility to 
manufacturing defects. Wider wires are more resilient to variation, but have greater 
capacitance and consume more dynamic power than thinner wires. Wider wires (and 
wider spacings to neighbors) are preferred for high-performance designs; thinner 
wires are preferred for low-power or less aggressive designs. After the initial wire 
width is chosen, it can be adjusted for individual segments based on timing analysis. 

Low-level skew optimization. Compared to the global impact of high-level skew 
optimizations, low-level skew optimizations cause smaller, localized changes. The 
precision of low-level skew optimizations is typically much greater than that of 
high-level skew optimizations. Low-level optimizations, such as wire sizing and 
wire snaking, are preferred for fine-tuning skew. To slow down fast sinks, the length 
of the path can be increased by purposely detouring the wire. This increases the total 
capacitance and resistance of the path, thus increasing the propagation delay. 

Variation modeling. Due to randomness in the semiconductor manufacturing 
process, every transistor in every chip is slightly different. In addition, every chip 
can be used at different ambient temperature, and will locally heat up or cool down 
depending on activity patterns. Supply voltage may also change depending on 
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manufacturing variation and power drawn by other parts of the chip. Nevertheless, 
modem clock trees must operate as expected under a variety of circumstances. To 
ensure such robustness, an efficient and accurate variation model encapsulates the 
different parameters, e.g., wire width and thickness, of each library element as 
well-defined random variables. However, predicting the impact of process variations 
is difficult. One option is to run a large number of individual simulations with 
different parameter settings {Monte-Carlo simulation), but this is slow and 
impractical in an optimization flow. 

A second option is to generate a lookup table that captures worst-case skew 
variations between pairs of sinks based on (1) technology node, (2) clock buffer and 
wire library, (3) tree path length, (4) variation model, and (5) desired yield. Though 
creating this table requires extensive simulations, this only needs to be done once for 
a given technology. The resulting table can be used for any compatible clock tree 
optimization, e.g., for clock buffer sizing, as previously explained in this section. In 
general, this lookup table approach facilitates a fast and accurate optimization. 

Further clock network design techniques are discussed in Chaps. 42-43 of [7.1], 
including active deskewing and clock meshes, common in modem CPUs, as well as 
clock gating, used to decrease clock power dissipation. The book [7.18] focuses on 
clocking in modem VLSI systems from a designer perspective and recommends a 
number of techniques to minimize the impact of process variations. The book [7.15] 
discusses clocking for high-performance and low-power apphcations. 
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Exercise 1: Net Ordering 

Recall from Chap. 5 that negotiated-congestion routing (NCR) allows for temporary 
routing violations and maintains history costs for every routing segment. These 
history costs can decrease the significance of net ordering. Given that NCR can be 
adapted to track-based detailed routing, give several reasons why net ordering 
remains important in area routing. 

Exercise 2: Octilinear Maze Searcii 

Compare octilinear maze search (Sec. 7.3.2) to breadth-first search (BFS) and 
Dijkstra's algorithm in terms of input data, output, and runtime. Can BFS or 
Dijkstra's algorithm be used instead of octilinear maze search? 

Exercise 3: Metliod of Means and Medians (MMM) 

Given the following locations of a clock source sq and eight sinks Si-s^: 

(a) Draw the clock tree generated by MMM. 

(b) Draw the clock tree topology generated by MMM. 

(c) Calculate the total wirelength (in terms of grid units) and the clock skew using 
the linear delay model. 



^o(7,l) 
^1 (2,2) 
^2 (4,2) 
^3 (2,12) 
^4(4,12) 
ss (8,8) 
S6 (8,6) 
^7 (14,8) 
^8 (14,6) 



So Sa 

9 ■^ • ^ 

•^ • s : 



Exercise 4: Recursive Geometric Matcliing (RGM) Algoritlim 

Consider the same locations of a clock source Sq and eight sinks i'l-i'g as in Exercise 3. 



(a) 



(b) 
(c) 



Draw the clock tree generated by RGM. Use the linear delay model when 

constructing tapping points. Let each sink have no delay, i.e., tidsi) = 0, where 

1 < / < 8. 

Draw the clock-tree topology generated by RGM. 

Calculate total wirelength (in terms of grid units) and clock skew using the 

linear delay model. 
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Exercise 5: Deferred-Merge Embedding (DME) Algorithm 

Given the locations of a source ^o and four sinks S\-S4 (left) and the following 
connection topology (middle), execute the bottom-up and top-down phases of DME, 
as described in Sec. 7.5.1. 

(a) Top-down phase: draw the merging segments for the internal nodes, ms{ui), 
ms{u2} and ms{u^. 

(b) Bottom-up phase: draw a minimum-cost embedded clock tree from sa to all four 
sinks S\-Si,. 



^o(7,l) 
^1 (2,9) 

^2(10,11) 

^3(4,1) 

^4 (14,7) 




On Sa 



h 

I ^3; ,.|,. Alj 



Exercise 6: Exact Zero Sliew 

Using the clock-tree topology and generated clock tree from Exercise 5: 
(a) Determine the exact locations of internal nodes U\, M2 and M3 to have zero skew. 
Let the unit length resistance and capacitance be a = 0.1 and P = 0.01, 
respectively. Use the Elmore delay model with the following sink information. 



C(5,) = 0.1 
tEoisi) = 



C(S2) = 0.2 
tEoisi) = 



C(^3) = 0.3 

tEliSi) = 



CC?4) = 0.2 

tEoisd = 



(b) Calculate the Elmore delay from ui to each sink S1-S4 in the zero skew free 
constructed in (a). 

Exercise 7: Bounded-Sltew DME 

Given the locations of a source i'oand four sinks S1-S4 (left), and using the following 
topology (center), calculate the bounded-skew feasible region for internal nodes, 11 1, 
U2 and M3, using the linear delay model. The skew bound is four grid edges. 



So (6,5) 
^1 (1,2) 
^2 (4,5) 
S3 (9,4) 
^4(12,1) 




I ! ^2: ^a:„ J„ L„,L..,„. 

S3 

M'n:ii:n :„.: ::f 

-4 i ,.- i'^ 



y Si Sj S3 
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8 Timing Closure 



The layout of an integrated circuit (IC) must not only satisfy geometric requirements, 
e.g., non-overlapping cells and routability, but also meet the design's timing 
constraints, e.g., setup (long-path) and hold (short-path) constraints. The 
optimization process that meets these requirements and constraints is often called 
timing closure. It integrates point optimizations discussed in previous chapters, such 
as placement (Chap. 4) and routing (Chaps. 5-7), with specialized methods to 
improve circuit performance. The following components of timing closure are 
covered in this chapter. 

— Timing-driven placement (Sec. 8.3) minimizes signal delays when assigning 
locations to circuit elements. 

— Timing-driven routing (Sec. 8.4) minimizes signal delays when selecting 
routing topologies and specific routes. 

— Physical synthesis (Sec. 8.5) improves timing by changing the netlist. 

— Sizing transistors or gates: increasing the width:length ratio of transistors 
to decrease the delay or increase the drive strength of a gate (Sec. 8.5. 1). 

— Inserting buffers into nets to decrease propagation delays (Sec. 8.5.2). 
Restructuring the circuit along its critical paths (Sec. 8.5.3). 

Sec. 8.6 integrates these optimizations in a performance-driven physical design flow. 

8.1 Introduction Li 



For many years, signal propagation delay in logic gates was the main contributor to 
circuit delay, while wire delay was negligible. Therefore, cell placement and wire 
routing did not noticeably affect circuit performance. Starting in the mid-1990s, 
technology scaling significantly increased the relative impact of wiring-induced 
delays, making high-quality placement and routing critical for timing closure. 

Timing optimization engines must estimate circuit delays quickly and accurately to 
improve circuit timing. Timing optimizers adjust propagation delays through circuit 
components, with the primary goal of satisfying timing constraints, including 

— Setup {long-path) constraints, which specify the amount of time a data input 
signal should be stable (steady) before the clock edge for each storage element 
(e.g., flip-flop or latch). 

— Hold-time {short-path) constraints, which specify the amount of time a data 
input signal should be stable after the clock edge at each storage element. 



A. B. Kahng et al., VLSI Physical Design: From Graph Partitioning to Timing Closure, 
DOI 10.1007/978-90-481-9591-6_8, © Springer Science+Business Media B.V. 2011 
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Setup constraints ensure that no signal transition occurs too late. Initial phases of 
timing closure focus on these types of constraints, which are formulated as follows. 

^cycle — ^combDelay ^setup '■skew 

Here, tc^cie is the clock period, tcomboeiav is the longest path delay through 
combinational logic, t,i,n,p is the setup time of the receiving storage element (e.g., 
flip-flop), and t^kew is the dock skew (Sec. 7.4). Checking whether a circuit meets 
setup constraints requires estimating how long signal transitions will take to 
propagate from one storage element to the next. Such delay estimation is typically 
based on static timing analysis (STA), which propagates actual arrival times (AATs) 
and required arrival times (RATs) to the pins of every gate or cell. STA quickly 
identifies timing violations, and diagnoses them by tracing out critical paths in the 
circuit that are responsible for these timing failures (Sec. 8.2. 1). 

Motivated by efficiency considerations, STA does not consider circuit fiinctionality 
and specific signal transitions. Instead, STA assumes that every cell propagates 
every 0-1 (1-0) transition from its input(s) to its output, and that every such 
propagation occurs with the worst possible delay'. Therefore, STA results are often 
pessimistic for large circuits. This pessimism is generally acceptable during 
optimization because it affects competing layouts equally, without biasing the 
optimization toward a particular layout. It is also possible to evaluate the timing of 
several competing layouts with more accurate techniques in order to choose the best 
solution. 

One approach to mitigate pessimism in STA is to analyze the most critical paths. 
Some of these can be false paths - those that cannot be sensitized by any input 
transition because of the logic functions implemented by the gates or cells. IC 
designers often enumerate false paths that are likely to become timing-critical to 
exclude them from STA results and ignore them during timing optimization. 

STA results are used to estimate how important each cell and each net are in a 
particular layout. A key metric for a given timing point g - that is, a pin of a gate or 
cell - is timing slack, the difference between g's RAT and AAT: slack(g) = RAT{g) 
— AAT(g). Positive slack indicates that timing is met - the signal arrives before it is 
required - while negative slack indicates that timing is violated - the signal arrives 
after its required time. Algorithms for timing-driven layout guide the placement and 
routing processes according to timing slack values. 

Guided by slack values, physical synthesis restructures the netlist to make it more 
suitable for high-performance layout implementation. For instance, given an 
unbalanced tree of gates, (1) the gates that lie on many critical paths can be upsized 
to propagate signals faster, (2) buffers may be inserted into long critical wires, and 
(3) the tree can be restructured to decrease its overall depth. 



Path-based approaches for timing optimizations are discussed in Sees. 8.3-8.4. 
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Hold-time constraints ensure that signal transitions do not occur too early. Hold 
violations can occur when a signal path is too short, allowing a receiving flip-flop to 
capture the signal at the current cycle instead of the next cycle. The hold-time 
constraint is formulated as follows. 

'■comhDelay — ^hold '■skew 

Here, tco„,i,Deiay is the delay of the circuit's combinational logic, ?/,ow is the hold time 
required for the receiving storage element, and 4^,,, is the clock skew. As clock skew 
affects hold-time constraints significantly more than setup constraints, hold-time 
constraints are typically enforced after synthesizing the clock network (Sec. 7.4). 

Timing closure is the process of satisfying timing constraints through layout 
optimizations and netlist modifications. It is common to use verbal expressions such 
as "the design has closed timing" when the design satisfies all timing constraints. 

This chapter focuses on individual timing algorithms (Sees. 8.2-8.4) and 
optimizations (Sec. 8.5), but in practice, these must be applied in a carefully 
balanced design flow (Sec. 8.6). Timing closure may repeatedly invoke certain 
optimization and analysis steps in a loop until no further improvement is observed. 
In some cases, the choice of optimization steps depends on the success of previous 
steps as well as the distribution of timing slacks computed by ST A. 
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Almost all digital ICs are synchronous, finite state machines {FSM), or sequential 
machines. In FSMs, transitions occur at a set clock frequency. Fig. 8.1 "unrolls" a 
sequential circuit in time, from one clock period to the next. The figure shows two 
types of circuit components: (1) clocked .storag^e elements, e.g., flip-flops or latches, 
also referred to as sequential elements, and (2) combinational logic. During each 
clock period in the operation of the sequential machine, (1) present-state bits stored 
in the clocked storage elements flow from the storage elements' output pins, along 
with system inputs, into the combinational logic, (2) the network of combinational 
logic then generates values of next-state functions, along with system outputs, and 
(3) the next-state bits flow into the clocked elements' data input pins, and are stored 
at the next clock tick. 




Fig. 8.1 A sequential circuit, consisting of flip-flops and combinational logic, "unrolled" in time. 
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The maximum clock frequency for a given design depends upon (1) gate delays, 
which are the signal delays due to gate transitions, (2) wire delays, which are the 
delays associated with signal propagation along wires, and (3) dock skew (Sec. 7.4). 
In practice, the predominant sources of delay in standard signals come from gate and 
wire delays. Therefore, when analyzing setup constraints, this section considers 
clock skew neghgible. A lower bound on the design's clock period is given by the 
sum of gate and wire delays along any timing path through combinational logic - 
from the output of a storage element to the input of the next storage element. This 
lower bound on the clock period determines an upper bound on the clock frequency. 

In earlier technologies, gate delays accounted for the majority of circuit delay, and 
the number of gates on a timing path provided a reasonable estimate of path delay. 
However, in recent technologies, wire delay, along with the component of gate delay 
that is dependent on capacitive loading, comprises a substantial portion of overall 
path delay. This adds complexity to the task of estimating path delays and, hence, 
achievable (maximum) clock frequency. 

For a chip to function correctly, pa^/i delay constraints (Sec. 8.3.2) must be satisfied 
whenever a signal transition traverses a path through combinational logic. The most 
critical verification task faced by the designer is to confirm that all path delay 
constraints are satisfied. To do this dynamically, i.e., using circuit simulation is 
infeasible for two reasons. First, it is computationally intractable to enumerate all 
possible combinations of state and input variables that can cause a transition, i.e., 
sensitize, a given combinational logic path. Second, there can be an exponential 
number of paths through the combinational logic. Consequently, design teams often 
signoff on circuit timing statically, using a methodology that pessimistically assumes 
all combinational logic paths can be sensitized. This framework for timing closure is 
based on static timing analysis (STA) (Sec. 8.2.1), an efficient, linear-time 
verification process that identifies critical paths. 

After critical paths have been identified, delay budgeting'^ (Sees. 8.2.2 and 8.3.1) sets 
upper bounds on the lengths or propagation delays for these paths, e.g., using the 
zero-slack algorithm [8.19], which is covered in Sec. 8.2.2. Other delay budgeting 
techniques are described in [8.29]. 

^ 8.2.1 Static Timing Analysis 

In STA, a combinational logic network is represented as a directed acyclic graph 
{DAG) (Sec. 1.7). Fig. 8.2 illustrates a network of four combinational logic gates x, 
y, z and w, three inputs a, b and c, and one output/ The inputs are annotated with 
times 0, and 0.6 time units, respectively, at which signal transitions occur relative 



This methodology is intended for layout of circuits directly represented by graphs rather than 
circuits partitioned into high-level modules. However, this methodology can also be adapted to 
assign budgets to entire modules instead of circuit elements. 
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to the start of the clock cycle. Fig. 8.2 also shows gate and wire delays, e.g., the gate 
delay from the input to the output of inverter x is 1 unit, and the wire delay from 
input b to the input of inverter x is 0. 1 units. For modem designs, gate and wire 
delays are typically on the order oi picoseconds (ps). 



a<0>- 



-(0.15)- 




y{2)) (0.2)- 



(0.1) 
/3<0> — (0.1)-|x (1]>0-1— (0.3) 



c<0.6>- 




(0.1). 




z(2)> 



n 

(0.25) 




IV (2)1 (0.2) — f 



Fig. 8.2 Three inputs a, h and c are annotated with times at which signal transitions occur in angular 
brackets. Each edge and gate is annotated with its delay in parentheses. 

Fig. 8.3 illustrates the corresponding DAG, which has one node for each input and 
output, as well as one node for each logic gate. For convenience, a source node is 
introduced with a directed edge to each input. Nodes corresponding to logic gates 
are labeled with the respective gate delays (e.g., node y has the label 2). Directed 
edges from the source to the inputs are labeled with transition times, and directed 
edges between gate nodes are labeled with wire delays. Because this DAG 
representation has one node per logic gate, it follows the gate node convention. The 
pin node convention, where the DAG has a node for each pin of each gate, is more 
detailed and will be used later in this section. 



a(0) 


— (0.15)— 


— *-y 


(2) 


/ 




/ 


\ 


(0) 


(0 


■1) 


(0.2) 


%—{Q)^^-{Q- 


1)*<TlI 




^ 


^ 


\ 




/ 


(0.6) 


(0.3) 


(0.25) 


\ 




\ 


/ 


c(0) 


— (0.1) — 


— *-z 


(2) 



I 
ir(2)-(0.2)^f^ 



Fig. 8.3 DAG representation using gate node convention of the circuit in Fig. 8.2. 



Actual arrival time. In a circuit, the latest transition time at a given node v e V, 
measured from the beginning of the clock cycle, is the actual arrival time (AAT), 
denoted as AAT(v). By convention, this is the arrival time at the output side of node 
V. For example, in Fig. 8.3, AAT{x) =1.1 because of the wire delay from input b 
(0. 1) and the gate delay of inverter x ( 1 .0). For node y, although the signal transitions 
on the path through a will arrive at time 2.15, the arrival time is dominated by 
transitions along the path through x. Hence, AAT(a) = 3.2. Formally, the AAT of 
node V is 
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AAT(v) = max {AAT{u) + t{u, v)) 

ueF/(v) 

where FI{v) is the set of all nodes from which there exists a directed edge to v, and 
t{u,v) is the delay on the (m,v) edge. This recurrence enables all AAT values in the 
DAG to be computed in 0{\V\ + |ii|) time or 0(|gates| + |edges|). This linear scaling 
of runtime makes STA applicable to modem designs with hundreds of millions of 
gates. Fig 8.4 illustrates the AAT computation from the DAG in Fig. 8.3. 



■>f(^ 



a{Q) (0.15)- 

/a /f< 3.2\ 

(0) (0.1) (0.2) 

/ \ 

^sj— (0)*i3(0)-(0.1)*x(1) jv^)-(0.2)^f(0)i 

AO A1.l\ /A5.65 A 5.85 
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\ \ / 

c(0) (0.1) >z{2) 

A 0.6 A 3.4 

Fig. 8.4 Actual arrival times (AATs) of the DAG, denoted with 'A', of Fig. 8.3. 

Although STA i5 pessimistic, i.e., the longest path in the DAG of a circuit might not 
actually be sensitizable, the design team must satisfy all timing consfraints. An 
algorithm to find all longest paths from the source in a DAG was proposed by 
Kirkpatrick in 1 966 [8.18]. It uses a topological ordering of nodes - if there exists a 
(m,v) edge in the DAG, then u is ordered before v. This ordering can be determined 
in linear time by reversing the post-order labeling obtained by depth-first search. 



Longest Paths Algorithm [8.18] 

Input: directed graph G(\/,£) 

Output: AATs of all nodes v e V based on worst-case (longest) paths 

1 . foreach (node v eV) 

2. AA7|\/] = -oo // all AATs are by default unknown 

3. /AAT[soL/rce] = // except source, which is 

4. Q = TOPO_SORT(\/) // topological order 

5. while (Q!=0) 

6. u= FIRST_ELEIVIENT(Q) // u is the first element in Q 

7. foreach (neighboring node v of u) 

8. AA7\v] = MAX{AAT[vlAATlu] + f[u][v]) // f[tv][v] is the {u,v) edge delay 

9. REMOVE(Q,u) // remove u from Q 

Required arrival time. The required arrival time (RAT), denoted as RAT{v), is the 
time by which the latest transition at a given node v must occur in order for the 
circuit to operate correctly within a given clock cycle. Unlike AATs, which are 
determined from multiple paths from upstream inputs and flip-flop outputs, RATs 
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are determined from multiple paths to downstream outputs and flip-flop inputs. For 
example, suppose that RAT(f) for the circuit in Fig 8.2 is 5.5. This forces RAT{w) to 
be 5.3, RAT(y) to be 3. 1, and so on (Fig. 8.5). Fon-nally, the RAT of a node v is 

RAT(v)= max {RAT{u)-t{u,v)) 

u^FO(v) 

where FO(v) is the set of all nodes with a directed edge from v. 
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Fig. 8.5 Required arrival times (RATs) of the DAG, denoted with 'R', of Fig. 8.3. 

Slack. Correct operation of the chip with respect to setup constraints, e.g., 
maximum path delay, requires that the AAT at each node does not exceed the RAT. 
That is, for all nodes v e V, AAT{v) < RAT(v) must hold. 
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Fig. 8.6 The STA result of the DAG of Fig. 8.3, showing the actual (A) and required (R) arrival 
times, and the slack (S) of each node. 

The slack of a node v, defined as 

slack(v) = RAT(v) - AAT{v) 
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is an indicator of whether the timing constraint for v has been satisfied. Critical 
paths or critical nets are signals that have negative slack, while non-critical paths 
or non-critical nets have positive slack. 

Timing optimization (Sec. 8.5) is the process by which (1) negative slack is 
increased to achieve design correctness, and (2) positive slack is reduced to 
minimize overdesign and recover power and area. Fig 8.6 illustrates the full STA 
computation, including slack, from Fig. 8.3. 

A DAG labeled with the pin node convention facilitates a more detailed and 
accurate timing analysis, as the delay of a gate output depends on which input pin 
has switched. Fig. 8.7 shows the circuit of Fig. 8.2 annotated with the pin node 
convention, where v, is the /"' input pin of gate v, and vq is the output pin of v. 
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Fig. 8.7 Circuit of Fig. 8.2 annotated with tlie pin node convention. For a logic gate v, v, denotes its 
;' ' input pin, and Vo denotes its output pin. 

Fig. 8.8 shows the result of STA constructed using the pin node convention. 
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Fig. 8.8 STA result for tlie circuit of Fig. 8.2, witti a DAG constructed using tlie pin node 
convention. Eacli node in tlie DAG is an input pin or an output pin of a logic gate. 



Current practice. In modem designs, separate timing analyses are performed for 
the cases of me delay (rising transitions) and fall delay (falling transitions). 
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Signal integrity extensions to STA consider changes in delay due to switching 
activity on neighboring wires of the path under analysis. For signal integrity 
analysis, the STA engine keeps track of windows {intervals) of AATs and RATs, 
and typically executes iTiultiple timing analysis iterations before these timing 
windows stabilize to a clean and accurate result. 

Statistical STA {SSTA) is a generalization of STA where gate and wire delays are 
modeled by random variables and represented by probability distributions [8.21]. 
Propagated AATs, RATs and timing slacks are also random variables. In this 
context, timing constraints can be satisfied with high probability (e.g., 95%). SSTA 
is an increasingly popular methodology choice for leading-edge designs, due to the 
increased manufacturing variability in advanced process nodes. Propagating 
statistical distributions instead of intervals avoids some of STA's inherent 
pessimism. This reduces the costly power, area and schedule impacts of overdesign. 

The static verification approach is continually challenged by two fundamental 
weaknesses - (1) the assumption of a clock and (2) the assumption that all paths are 
sensitizable. First, STA is not applicable in asynchronous contexts, which are 
increasingly prevalent in modern designs, e.g., asynchronous interfaces in 
systems-on-chips (SoCs), asynchronous logic design styles to improve speed and 
power. Second, optimization tools waste considerable runtime and chip resources - 
e.g., power, area and speed - satisfying "phantom" constraints. In practice, designers 
can manually or semi-automatically specify false and multicycle paths - paths 
whose signal transitions do not need to finish within one clock cycle. Methodologies 
to fully exploit the availabilify of such timing exceptions are still under development. 

>■ 8.2.2 Delay Budgeting with the Zero-Slacic Algorithm 

In timing-driven physical design, both gate and wire delays must be optimized to 
obtain a timing-correct layout. However, there is a chicken-and-egg dilemma: (1) 
timing optimization requires knowledge of capacitive loads and, hence, actual 
wirelength, but (2) wirelengths are unknown until placement and routing are 
completed. To help resolve this dilemma, timing budgets are used to establish 
delay and wirelength constraints for each net, thereby guiding placement and 
routing to a timing-correct result. The best-known approach to timing budgeting is 
the zero-slack algorithm {ZSA) [8.19], which is widely used in practice. 

Algorithm. Consider a netlist consisting of logic gates vi, V2, . . . , v„ and nets ei, 62, 
... ,e„, where e, is the output net of gate v,. Let t{v) be the gate delay of v, and let 
t{e) be the wire delay of e.^ The ZSA takes the netlist as input, and seeks to 
decrease positive slacks of all nodes to zero by increasing ^(v) and t{e) values. 
These increased delay values together constitute the timing budget TB(v) of node 
V, which should not be exceeded during placement and routing. 



" A multi-fanout net e, has multiple source-sink delays, so ZSA must be adjusted accordingly. 
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TB{v) = t(v) + t{e) 

If TB(v) is exceeded, then the place-and-route tool typically (1) decreases the 
wirelength of e by replacement or rerouting, or (2) changes the size of gate v. The 
delay impact of a wire or gate size change can be estimated using the Elmore 
delay model [8.13]. If most arcs (branches) of a timing path are within budget, 
then the path may meet its timing constraints even if some arcs exceed their 
budgets. Thus, another approach to satisfying the timing budget is (3) rebudgeting. 
As in Sec. 8.2.1, let AAT{y), RAT{y) and slack(v) denote respectively the AAT, 
RAT, and slack at node v in the timing graph G. 

Zero-Slack Algorithm (Late-Mode Analysis) 

Input: timing graph G{V,E) 

Output: timing budgets TB for each v e V 

1. do 

2. {AAT,RAT,slack) = ST A{G) 

3. foreach (v, e V) 

4. TBM = DEI_AY(v,) + DELAY(e,) 

5. slackmin = °° 

6. foreach {v e V) 

7. if {{slack[v] < slackmin) and {slack[v] > 0)) 

8. slackmin = slack[v] 

9- Vmin = V 

10. \^ {slackmin i" °°) 

1 1 . path = Vmin 

12. ADD_T0_FR0NT(paf/7,BACKWARD_PATH(v™„,G)) 

13. ADD_T0_BACK(paf/7,F0RWARD_PATH(\/™„,G)) 

14. s = slackmin I \path\ 

15. for(/= 1 to|paf/7|) 

1 6. node = path[i\ II evenly distribute 

1 7. TB[node\ = TB[node] + s II slack along path 

1 8. while (slackmin 5^ °°) 

The ZSA consists of three major steps. First, determine the initial slacks of all nodes 
(lines 2-4), and select a node v„„„ with minimum positive slack slack,,,/,, (lines 5-9). 
Second, find a path path of nodes that dominates slack{v„i„), i.e., any change in 
delays m path's nodes will cause slack{v„,i„) to change. This is done by calling the 
two procedures BACKWARDPATH and FORWARDPATH (lines 12-13). Third, 
evenly distribute the slack by increasing TB{y) for each v in path (lines 14-17). Each 
budget increment s will decrement a node slack slack{v). By repeating the process 
(lines 1-18), the slack of each node in F will end up at zero. The resulting timing 
budgets at all nodes are the final output of ZSA. Further details of ZSA, including 
proofs of correctness and complexity analyses, are given in [8.19]. 

FORWARD _PATH{v„,„„G) constructs a path starting with node v„,„„ and iteratively 
adds a node v to the path from among the fanouts of the previously-added node in 
path (lines 2-10). Each node v satisfies the condition in line 6, where RAT{v„,j,^ is 
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determined by RAT(v), and AAT(v) is determined by AAT(v„,i„) - changing the delay 
of either node affects the slack of both nodes. 

Forward Path Search (FORWARD_PATH(v™„,G)) 
Input: node Vmin with minimum slack slackmin, timing graph G 
Output: maximal downstream path path from Vmin such that no node v e V affects 
the slacl< of path 

1 . path = V:r,in 

2. do 

3. flag - false 

4. node = LAST_ELEIViENT(paf/7) 

5. foreach (fanout node fo of node) 

6. if {{RATlfo] == RAT[node] + TB[fo]) and {AAT[fo] == AATlnode] + TB[fo])) 

7. ADD_TO_BACK(paf/7,fo) 

8. flag = true 

9. break 

10. while (flag == true) 

11. REIVIOVE_FIRST_ELEIVIENT(paf/7) // remove i/™>, 

BACKWARD _PATH{v„„„,G) iteratively finds a fanin node of the current node so that 
both nodes' slacks will change if either node's delay is changed. 

Backward Path Search (BACKWARD_PATH(v™„,G)) 
Input: node Vmin with minimum slack slackmin, timing graph G 
Output: maximal upstream path path from Vmin such that no node i/ e V affects the 
slack of path 

1 . path = Vmin 

2. do 

3. flag = false 

4. node = FIRST_ELEIVIENT(paf/7) 

5. foreach (fanin node ff of node) 

6. if {{RAT[fi\ == RAT[node] - TB[fi\) and {AAT[fi] == AA7\node] - TB[fi])) 

7. ADD_T0_FR0NT(paf/7,f/) 

8. flag = true 

9. break 

10. while (ffag == true) 

11. REIVIOVE_LAST_ELEIViENT(paf/7) // remove \/™>, 

Early-mode analysis. ZSA uses late-mode analysis with respect to setup 
constraints, i.e., the latest times by which signal transitions can occur for the circuit 
to operate correctly. Correct operation also depends on satisfying hold-time 
constraints on the earliest signal transition times. Early-mode analysis considers 
these constraints. If the data input of a sequential element changes too soon after the 
triggering edge of the clock signal, the logic value at the output of that sequential 
element may become incorrect during the current clock cycle. As geometries shrink, 
early-mode violations have become an overriding concern. While setup violations 
can be avoided by lowering the chip's operating frequency, the chip's cycle time 
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does not affect hold-time constraints (Sec. 8.1). Violations of hold-time constraints 
typically result in chip failures. 

To correctly analyze this timing constraint, the earliest actual arrival time of 
signal transitions at each node must be determined. The required arrival time of a 
sequential element in early mode is the time at which the earliest signal can arrive 
and still satisfy the library-cell hold-time requirement. 

For each gate v, AATEijy} > RATe/J^v) must be satisfied, where AATe/Jv) is the 
earliest actual arrival time of a signal transition, and RATem(v) is the required 
arrival time in early mode, at gate v. The early-mode slack is then defined as 

slacksMiv) = AATem(v) - RATem(v) 

When adapted to early-mode analysis, ZSA is called the near zero-slack 
algorithm. The adapted algorithm seeks to decrease TB(y) by decreasing ^(v) or 
t(e), so that all nodes have miniiTium early-iTiode tiiTiing slacks. However, since 
^(v) and t{e) cannot be negative, node slacks may not necessarily all become zero. 

Near Zero-Slack Algorithm (Early-Mode Analysis) 

Input: timing graph G(V,E) 

Output: timing budgets TB for each v eV 

1 . foreach (node v sV) 

2. done[v\ = false 

3. do 

4. {RATEM,AATEM,slackEM) = STA_EIVI(G) // early-mode STA 

5. slackmin = °° 

6. foreach (node v, e V) 

7. TB[vi\ = DELAY(i/,) + DELAY(e,) 

8. foreach (node v e V) 

9. if {{clone[v] == false) and (slackEM[v] < slackmin) and {slackEM[v] > 0)) 

10. slackmin = slackEi4v] 

1 1 . Vmin = V 

12. if (s/ac/(™„ # «=) 

13. path = Vmin 

14. ADD_TO_FRONT(paf/7,BACKWARD_PATH_ EM(Vmin,G)) 

15. ADD_TO_BACK(paf/7,FORWARD_PATH _EIVI(i'™„,G)) 

16. for(/= 1 to|paf/7|) 

17. node = path[i\ 

18. path_E[i] = FIND_EDGE(\/[noc/e],£) // corresponding edge of Vi 

19. for(/= 1 to|paf/7|) 

20. node = path[i\ 

21. s = mUislackmin I \path\,DELAY(path_E[i])) 

22. TB[node] = TB[node] -s II decrease DELAY(noc/e) or 

// DELAY(paf/7_£[noc/e]) 

23. if (DELAY(paf/7_£[/l) == 0) 

24. done[node] = true 

25. while (slackmin < °°) 
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In relation to the ZSA pseudocode, the procedure BACKWARD PATH _EM{v,„i,„G) 
is equivalent to BACKWARD _PATH(y„,„„G), and FORWARD_PATH_EM(y„,i„,G) is 
the equivalent to FORWARD _PATH{v„,i„,G), except that early-mode analysis is 
used for all arrival and required times. 

Compared to the original ZSA, the two differences are (1) the use of early-mode 
timing constraints and (2) the handling of the Boolean flag doneiy). Lines 1-2 set 
done{v) to false for every node v in V. If ?(e,) reaches zero for a node v„ done{v^ is 
set to true (lines 23-24). In subsequent iterations (lines 3-25), if v, is the minimum 
slack node of G, it will be skipped (line 9) because ?(e,) cannot be decreased 
further. After the algorithm completes, each node v will either have slack{v) = or 
done{v) = true. 

In practice, if the delay of a node does not satisfy its early-mode timing budget, the 
delay constraint can be satisfied by adding additional delay (padding) to appropriate 
components. However, there is always the danger that additional delay may cause 
violations of late-mode timing constraints. Thus, a circuit should be first designed 
with ZSA and late-mode analysis. Early-mode analysis may then be used to confirm 
that early-mode constraints are satisfied, or to guide circuit modifications to satisfy 
such constraints. 
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8.3 



Timing-driven placement (TDP) optimizes circuit delay, either to satisfy all timing 
constraints or to achieve the greatest possible clock frequency. It uses the results of 
STA (Sec. 8.2.1) to identify critical nets and attempts to improve signal propagation 
delay through those nets. Typically, TDP minimizes one or both of the following. (1) 

worst negative slack (WNS) 

WNS = mm{slack(x)) 

TeT 

where T is the set of timing endpoints, e.g., primary outputs and inputs to flip-flops, 

and (2) total negative slack (TNS) 

TNS= ^slackix) 

T!ET,slack(T)<0 

Algorithmic techniques for timing-driven placement can be categorized as net-based 
(Sec. 8.3.1), path-based or integrated (Sec. 8.3.2). There are two types of net-based 
techniques - (1) delay budgeting assigns upper bounds to the timing or length of 
individual nets, and (2) net weighting assigns higher priorities to critical nets during 
placement. Path-based placement seeks to shorten or speed up entire timing-critical 
paths rather than individual nets. While more accurate than net-based placement, 
path-based placement does not scale to large, modem designs because the number of 
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paths in some circuits, such as multiphers, can grow exponentially with the number 
of gates. Both path-based and net-based approaches (1) rely on support within the 
placement algorithm, and (2) require a dedicated infrastructure for (incremental) 
calculation of timing statistics and parameters. Some placement approaches facilitate 
integration with timing-driven techniques. For instance, net weighting is naturally 
supported by simulated annealing and all analytic algorithms. Netlist partitioning 
algorithms support small integer net weights, but can usually be extended to support 
non-integer weights, either by scaling or by replacing bucket-based data structures 
with more general priority queues. 

Timing-driven placement algorithms often operate in multiple iterations, during 
which the delay budgets or net weights are adjusted based on the results of STA. 
Integrated algorithms typically use constraint-driven mathematical formulations in 
which STA results are incorporated as constraints and possibly in the objective 
function. Several TDP methods are discussed below, while more advanced 
algorithms can be found in [8.8], [8.17], [8.20], and Chap. 21 of [8.5]. 

In practice, some industrial flows do not incorporate timing-driven methods during 
initial placement because timing infomiation can be very inaccurate until locations 
are available. Instead, subsequent placement iterations, especially during detailed 
placement, perform timing optimizations. Integrated methods are commonly used; 
for example, the linear programming formulation (Sec. 8.3.2) is generally more 
accurate than net-weighting or delay budgeting, at the cost of increased runtime. A 
practical design flow for timing closure is introduced in Sec. 8.6. 

>■ 8.3.1 Net-Based Techniques 

Net-based approaches impose either quantitative priorities that reflect timing 
criticality (net weights), or upper bounds on the timing of nets, in the form of net 
constraints (delay budgets). Net weights are more effective at the early design stages, 
while delay budgets are more meaningful if timing analysis is more accurate. More 
information on net weighting can be found in [8. 12]. 

Net weighting. Recall that a traditional placer optimizes total wirelength and 
routability. To account for timing, a placer can minimize the total weighted 
wirelength, where each net is assigned a net weight (Chap. 4). Typically, the higher 
the net weight is, the more timing-critical the net is considered. In practice, net 
weights are assigned either statically or dynamically to improve timing. 

Static net weights are computed before placement and do not change. They are 
usually based on slack - the more critical the net (the smaller the slack), the greater 
the weight. Static net weights can be either discrete, e.g., 

C0[ if slack > 

, where coi > 0, ©2 > 0, and ©2 > <»i 
002 if slack < 
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where coi < CO2 are constants greater than zero, or continuous, e.g., 

. , slack\ 
w= 1 



t J 
where t is the longest path delay and a is a criticality exponent. 

In addition to slack, various other parameters can be accounted for, such as net size 
and the number of critical paths traversing a given net. However, assigning too 
many higher weights may lead to increased total wirelength, routability difficulties, 
and the emergence of new critical paths. In other words, excessive net weighting 
may eventually lead to inferior timing. To this end, net weights can be assigned 
based on sensitivity, or how each net affects TNS. For example, the authors of [8.27] 
define the net weight of «e^ as follows. Let 

— Wo(net) be the original net weight of net 

— slack(net) he the slack of net 

— slack,arget bc the target slack of the design 

— s^, (we?) be the slack sensitivity to the net weight of «e^ 

— s„ (net) be the TNS sensitivity to the net weight of net 

— a and P be constant bounds on the net weight change that control the tradeoff 
between WNS and TNS 

Then, if slacMjiet) < 0, 

w(net) = Wo(net) + a ■ (slackiarget - slacMjiet)) ■ s^, {net) + P ■ s^^ {net) 
Otherwise, \fslack{net) > 0, then w{net) remains the same, i.e., w{net) = Wo{net). 

Dynamic net weights are computed during placement iterations and keep an updated 
timing profile. This can be more effective than static net weights, since they are 
computed before placement, and can become outdated when net lengths change. An 
example method updates slack values based on efficient calculation of incremental 
slack for each net net [8.7]. For a given iteration k, let 

— slackk-\{net) he the slack at iteration k-\ 

— Si («e^) be the delay sensitivity to the wirelength of «e? 

— AL{net) he the change in wirelength between iteration k- I and k for net 

Then, the estimated slack of net at iteration k is 

slackiJinet) = slackt-i{net) -s^ {net) ■ AL{net) 
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After the timing information has been updated, the net weights should be adjusted 
accordingly. In general, this incremental method of weight modification is based on 
previous iterations. For instance, for each net net, the authors of [8.14] first compute 
the net criticality u at iteration k as 



Vkinet) = 



— {vt-i(net) +1) if net is among the 3% most critical nets 

2 

1 

—Vk-\(net) otherwise 



and then update the net weights as 

wilnet) = wt_i(«eO ■ (1 + Vilnet)) 

Variants include using the previousy iterations and using different relations between 
the net weight and criticality. 

In practice, dynamic methods can be more effective than using static net weights, 
but require careful net weight assignment. Unlike static net weights, which are 
relevant to any placer, dynamic net weights are typically tailored to each type of 
placer; their computation is integrated with the placement algorithm. To be scalable, 
the re-computation of timing information and net weights must be efficient [8.7]. 

Delay budgeting. An alternative to using net weights is to limit the delay, or the 
total length, of each net by using net constraints. This mitigates several drawbacks 
of net weighting. First, predicting the exact effect of a net weight on timing or total 
wirelength is difficult. For example, increasing weights of multiple nets may lead to 
the same (or very similar) placement. Second, there is no guarantee that a net's 
timing or length will decrease because of a higher net weight. Instead, net-constraint 
methods have better control and explicitly limit the length or slack of nets. However, 
to ensure scalability, net constraints must be generated such that they do not 
over-constrain the solution space or limit the total number of solutions, thereby 
hurting solution quality. In practice, these net constraints can be generated statically, 
before placement, or dynamically, when the net constraints are added or modified 
during each iteration of placement. A common method to calculate delay budgets is 
the zero-slack algorithm (ZSA), previously discussed in Sec. 8.2.2. Other advanced 
methods for delay budgeting can be found in [8.15]. 

The support for constraints in each type of placer must be implemented carefully so 
as to not sacrifice runtime or solution quality. For instance, min-cut placers must 
choose how to assign cells to partitions while meeting wirelength constraints. To 
meet these constraints, some cells may have to be assigned to certain partitions. 
Force-directed placers can adjust the attraction force on certain nets that exceed a 
certain length, but must ensure that these forces are in balance with those forces on 
other nets. More advanced algorithms for min-cut and force-directed placers on TDP 
can be found in [8.16] and [8.26], respectively. 
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^ 8.3.2 Embedding STA into Linear Programs for Piacement 

Unlike net-based methods, where the timing requirements are mapped to net weights 
or net constraints, path-based methods for timing-driven placement directly optimize 
the design's timing. However, as the number of (critical) paths of concern can grow 
quickly, this method is much slower than net-based approaches. To improve 
scalability, timing analysis may be captured by a set of constraints and an 
optimization objective within a mathematical programming framework, such as 
linear programming. In the context of timing-driven placement, a linear program 
(LP) minimizes a function of slack, such as TNS, subject to two major types of 
constraints: (1) physical, which define the locations of the cells, and (2) timing, 
which define the slack requirements. Other constraints such as electrical constraints 
may also be incorporated. 

Physical constraints. The physical constraints can be defined as follows. Given the 
set of cells Fand the set of nets E, let 

— X,, andji, be the center of cell v e V 

— Ve be the set of cells connected to net e e £ 

— left{e), right(e), bottom(e), and top{e) respectively be the coordinates of the left, 
right, bottom, and top boundaries of e's bounding box 

— 5x(v,e) and 5,,(v,e) be pin offsets from x,, andy,. for v's pin connected to e 

Then, for all v e V^, 

left(e)<x,+d,(v,e) 

right (e) > x,, + 5 ^ (v, e) 

bottom(e) < y^ + 5^, (v, e) 

top{e)>y^+8y(v,e) 

That is, every pin of a given net e must be contained within e's bounding box. Then, 
e's half-perimeter wirelength (HPWL) (Sec. 4.2) is defined as 

L(e) = right(e) - left(e) + top(e) - bottom(e) 
Timing constraints. The timing constraints can be defined as follows. Let 

~ tGATE(vi,Vo) be the gate delay from an input pin v, to the output pin Vg for cell v 

— tf^Ei{s,Uo,Vi) be net e's delay from cell m's output pin Uo to cell v's input pin v, 

— AAT(vj) be the arrival time on piny of cell v 

Then, define two types of timing constraints - those that account for input pins, and 
those that account for output pins. 
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For every input pin v, of cell v, the arrival time at each v, is the arrival tiiTie at the 
previous output pin Ug of cell u plus the net delay. 

AAT{v^ ^AAT{ug) + tMEi{Uo,v^ 

For every output pin v^ of cell v, the arrival time at Vg should be greater than or equal 
to the arrival time plus gate delay of each input v,. That is, for each input v, of cell v, 

AAT{v,)>AATiv^) + ?G^r£(v„v„) 

For every pin x^ in a sequential cell x, the slack is computed as the difference 
between the required arrival time RAT(Xp) and actual arrival time AAT{Tp). 

slack(%p)<RAT{%p) -AAT(Xp) 

The required time RAT{Xp) is specified at every input pin of a flip-flop and all 
primary outputs, and the arrival time AAT{x^ is specified at each output pin of a 
flip-flop and all priiTiary inputs. To ensure that the program does not over-optimize, 
i.e., does not optimize beyond what is required to (safely) meet timing, upper boimd 
all pin slacks by zero (or a siTiall positive value). 

slack(Tp) < 

Objective functions. Using the above constraints and definitions, the LP can 
optimize (1) total negative slack (TNS) 

max: ^ slack{x j,) 

T eP//7.¥(T),TeT 

where Pins{x) is the set of pins of cell x, and T is again the set of all sequential 
elements or endpoints, or (2) worst-negative slack (WNS) 

max : WNS 

where WNS < slack{Xp) for all pins, or (3) a combination of wirelength and slack 

min-.Yj Lie) -a- WNS 

eeE 

where E is the set of all nets, a is a constant between and 1 that trades off WNS 
and wirelength, and L(e) is the HPWL of net e. 
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8.4 



In modern ICs, interconnect can contribute substantially to total signal delay. Thus, 
interconnect delay is a concern during the routing stages. Timing-driven routing 
seeks to minimize one or both of (1) maximum sink delay, which is the maximum 
interconnect delay from the source node to any sink of a given net, and (2) total 
wirelength, which affects the load-dependent delay of the net's driving gate. 

For a given signal net net, let so be the source node and sinks = {s], ...,s„} be the 
sinks. Let G = {V,E) be a corresponding weighted graph where V = {vo,vi, . . . ,v„} 
represents the source and sink nodes of net, and the weight of an edge e(v„vy) e E 
represents the routing cost between the tenninals v, and v,. For any spanning tree T 
over G, let radius(T) be the length of the longest source-sink path in T, and let 
cost(T) be the total edge weight of T. 

Because source-sink wirelength reflects source-sink signal delay, i.e., the linear and 
Elmore delay [8.13] models are well-correlated, a routing tree ideally minimizes 
both radius and cost. However, for most signal nets, radius and cost cannot be 
minimized at the same time. 
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/g) radius(T) = 8 
cost{T) = 20 



/|3\ radius(T) = 13 
cosf(7) = 13 



/(,\ raclius(T) = 1 1 
cost{T)= 16 



Fig. 8.9 Delay vs. cost, i.e., "shallow" vs. "light", tradeoff in tree construction, (a) A shortest-paths 
tree, (b) A minimum-cost tree, (c) A compromise with respect to radius (depth) and cost. 



Fig. 8.9 illustrates this radius vs. cost ("shallow" vs. "lighf ) tradeoff, where the 
labels represent edge costs. The tree in Fig. 8.9(a) has minimum radius, and the 
shortest possible path length from the source to every sink. It is therefore a 
shortest-paths tree, and can be constructed using Dijkstra's algorithm (Sec. 5.6.3). 
The tree in Fig. 8.9(b) has minimum cost and is a minimum spanning tree (MST), 
and can be constructed using Prim's algorithm (Sec. 5.6.1). Due to their respective 
large cost and large radius, neither of these trees may be desirable in practice. The 
tree in Fig. 8.9(c) is a compromise that has both shallow and light properties. 
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>■ 8.4.1 The Bounded-Radius, Bounded-Cost Algorithm 

The bounded-radius, bounded-cost (BRBC) algorithm [8.11] finds a shallow-light 
spanning tree with provable bounds on both radius and cost. Each of these 
parameters is within a constant factor of its optimal value. From graph G{V,E), the 
algorithm first constructs a subgraph G ' that contains all v e V, and has small cost 
and small radius. Then, the shortest-paths tree Tbrbc in G' will also have small 
radius and cost because it is a subgraph of G '. Tbrbc is determined by a parameter e 
> 0, which trades off between radius and cost. When e = 0, Tbrbc has minimum 
radius, and when e = oo, Tbrbc has minimum cost. More precisely, Tbrbc satisfies 
both 

radius (Tgj^(^) < (1 + e) ■ radius (T^) 
where Ts is a shortest-paths tree of G, and 



costiTsj^c)M'^ + -\-cost(TM) 



where T^ is a minimum spanning tree of G. 



BRBC Algorithm 

Input: graph G(V,E), parameter £ > 

Output: spanning tree 7"eRec 

1 . Ts = SHORTEST_PATHS_TREE(G) 

2. 7M=iVllNIIVIUIVI_C0ST_TREE(G) 

3. G' = Tm 

4. U= DEPTH_FIRST_T0UR(7/w) 

5. sum = 

6. for(/=1 to|L/|-1) 

7. Uprev = U[i\ 

8. Ucurr=U[i+:] 

9. sum = sum + COStTfJUprsv][Ucurr] 

10. if (sum > £ ■ COStTg[Vo][Ucurr]) 

11. G'= ADD(G', PAJH{Ts,Vo,Uourr)) 

12. sum = 

1 3. 7"sRec = SHORTEST_PATHS_TREE(G') 



// sum of Uprev ~ Ucurr costs in Tm 
II shortest-path cost in 7s 
// from source Vo to Ucurr 
II add shortest-path edges to G' 
// and reset sum 



To execute the BRBC algorithm, compute a shortest-paths tree 7^ of G, and a 
minimum spanning tree Tm of G (lines 1 -2). Initialize the graph G ' to Tm (line 3). Let 
U be the sequence of edges corresponding to any depth-first tour of Tm (line 4). This 
tour traverses each edge of Tm exactly twice (Fig. 8.10), so cost(U) = 2 ■ cost{T]^. 
Traverse U while keeping a running total sum of traversed edge costs (lines 6-9). As 
the traversal visits each node «„„.,., check whether sum is strictly greater than the cost 
(distance) between Vq and Wnuv in 7$. If so, merge the edges of the s^ ~ Wcmrpath with 
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G' and reset sum to (lines 11-12). Continue traversing (7 while repeating this 
process (lines 6-12). Return Tbrbc, a shortest-paths tree over G' (line 13). 



,--S2<- S3 <- 

n " 



Fig. 8.10 Depth-first tour of the minimuin 
spanning tree in Fig. 8.9 with traversal sequence: 

So -^ Si -^ S2 ~^ S3 -^ S4 -^ Si -^ S2 ~^ Si "^ Si). 



► 8.4.2 Prim-Dijkstra Tradeoff 



Another method to generate a routing tree trades off radius and cost by using an 
explicit, quantitative metric. Typically, the minimum cost and minimum radius 
objectives are optimized by Prim's minimum spanning tree algorithm (Sec. 5.6.1) 
and Dijkstra 's shortest paths tree algorithm (Sec. 5.6.3), respectively. Although 
these two algorithms target two different objectives, they construct spanning trees 
over the set of terminals in very similar ways. From the set of sinks S, each 
algorithm begins with tree T consisting only of So, and iteratively adds a sink Sj and 
the edge connecting a sink Sj in T to Sj. The algorithms differ only in the cost 
function by which the next sink and edge are chosen. 

In Prim's algorithm, sink Sj and edge e{Si,Sj) are selected to minimize the edge cost 
between sinks Si and Sj 

COSt(Si,S,) 

where Sj e T and Sj g S - T. In Dijkstra' s algorithm, sink Sj and edge e(Si,Sj) are 
selected to minimize the path cost between source So and sink Sj 

COSt(Si) + COSt(Si,Si) 

where Si e T,SjgS- T, and cost{si) is the total cost of the shortest path from so to Sj. 

To combine the two objectives, the authors of [8.1] proposed the PD tradeoff, an 
explicit compromise between Prim's and Dijkstra' s algorithms. This algorithm 
iteratively adds the sink Sj and the edge e{Si,Sj) to T such that 

y ■ cost(si) + cost(Si,Sj) 

is minimum over all 5, e T and Sj e S- T for a prescribed constant < y < 1 . 



242 8 Timing Closure 



When y = 0, the PD tradeoff is identical to Prim's algorithm, and T is a minimum 
spanning tree Tm- As y increases, the PD tradeoff constructs spanning trees with 
progressively higher cost but lower radius. When y = 1, the PD tradeoff is identical 
to Dijkstra's algorithm, and Tis a shortest-paths tree T^. 

Fig. 8. 11 shows the behavior of the PD tradeoff with different values of y. The tree 
in Fig. 8.1 1(a) is the result of a smaller value of y, and has smaller cost but larger 
radius than the tree in Fig. 8.1 1(b). 



1 




A 




7 












/ 










/7 










/ 










I 




y 








\ 




/ 










N 


I / 












\ / 













\/ 










^Sq 







A 




\ 1 












\ 












\ii 












v , 












\ y 








\ 






\ / 










\8 


\ / 












\ 


v\/ 























— 




— ^ 



(a) 



radiusiV) = 19 
cost(T) = 35 



(b) 



radius{T) = 15 
cost(T) = 39 



Fig. 8.11 Result of the Prim-Dijkstra (PD) tradeoff. Let the cost between any two termmals be their 
Manhattan distance, (a) A tree with y = 1/4. (b) A tree with y = 3/4, which has higher cost but 
smaller radius. 



► 8.4.3 Minimization of Source-to-Sinl^ Delay 

The previous subsections described algorithms that seek radius-cost tradeoffs. Since 
the spanning tree radius reflects actual wire delay, these algorithms indirectly 
minimize sink delays. However, the wirelength-delay abstraction, as well as the 
parameters e in BRBC and y in the PD tradeoff, prevent direct control of delay. 
Instead, given a set of sinks S, the Elmore routing tree {ERT) algorithm [8.6] 
iteratively adds sink Sj and edge e{Si,s^ to the growing tree T such that the Elrnore 
delay from the source Sq to the sink Sj, where St e T and Sj ^ S- T, is minimized. 

Since the ERT algorithm does not treat any particular sink differently from the 
others, it is classified as a net-dependent approach. However, during the actual 
design and timing optimization, different timing constraints and slacks are imposed 
for each sink of a multi-pin net. 



The sink with the least timing slack is the critical sink of the net. A routing tree 
construction that is oblivious to critical-sink information may create avoidable 
negative slack, and degrade the overall timing performance of the design. Thus, 
several routing tree constructions, i.e., path-dependent approaches, have been 
developed that address the critical-sink routing problem. 
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Critical-sink routing tree (CSRT) problem. Given a signal net net with source Sq, 
sinks S = {s\, ... ,s„), and sink criticalities a(0 > for each St e S, construct a 
routing tree T such that 



^aii)-tisQ,s,) 



is minimized, where t{so,s^ is the signal delay from source Sq to sink Sj. The sink 
criticality a(/) reflects the timing criticality of the corresponding sink Sj. If a sink is 
on a critical path, then its timing criticality will be greater than that of other sinks. 

A critical-sink Steiner tree heuristic [8.18] for the CSRT problem [8.6] first 
constructs a heuristic minimum-cost Steiner tree T}, over all terminals of 5* except the 
critical sink s„ the sink with the highest criticality. Then, to reduce ^(i'o,^c), the 
heuristic adds Sc into Tq by heuristic variants, e.g., such as the following approaches. 

— Ho. introduce a single wire from Sc to s^. 

— Hi: introduce the shortest possible wire that can join s^ to Tq, so long as the path 
from So to Sc is monotone, i.e., of shortest possible total length. 

— Hgesi'. try all shortest connections from s,. to edges in Tq, as well as from Sc to So- 
Perform timing analysis on each of these trees and return the one with the 
lowest delay at Sc- 

The time complexity of the critical-sink Steiner heuristic is dominated by the 
construction of To, or by the timing analysis in the Hges, variant. Though Hgest 
achieves the best routing solution in terms of timing slack, the other two variants 
may also provide acceptable combinations of runtime efficiency and solution quality. 
For high-performance designs, even more comprehensively timing-driven routing 
tree constructions are needed. Available slack along each source-sink timing arc is 
best reflected by the required arrival time (RAT) at each sink. In the following RAT 
tree problem formulation, each sink of the signal net has a required arrival time 
which should not be exceeded by the source-sink delay in the routing tree. 

RAT tree problem. For a signal net with source so and sink set S, find a 
minimum-cost routing tree T such that 

mm{RAT{s)-t{sQ,s))>Q 

Here, RAT{s) is the required arrival time for sink s, and t{so,s) is the signal delay in T 
from source So to sink s. Effective algorithms to solve the RAT tree problem can be 
found in [8.19]. More information on timing-driven routing can be found in [8.3]. 
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^ 8.5 Physical Synthesis 



Recall from Sec. 8.2 that the correct operation of a chip with respect to setup 
constraints requires that AAT < RAT at all nodes. If any nodes violate this condition, 
i.e., exhibit negative slack, then physical synthesis, a collection of timing 
optimizations, is applied until all slacks are non-negative. There are two aspects to 
the optimization - timing budgeting and timing correction. During timing budgeting, 
target delays are allocated to arcs along timing paths to promote timing closure 
during the placement and routing stages (Sees. 8.2.2 and 8.3.1), as well as during 
timing correction (Sees. 8.5.1-8.5.3). During timing correction, the netlist is 
modified to meet timing constraints using such operations as changing the size of 
gates, inserting buffers, and netlist restructuring. In practice, a critical path of 
minimum-slack nodes between two sequential elements is identified, and timing 
optimizations are applied to improve slack without changing the logical function. 

^ 8.5.1 Gate Sizing 

In the standard-cell methodology, each logic gate, e.g., NAND or NOR, is typically 
available in multiple sizes that correspond to different drive strengths. The drive 
strength is the amount of current that the gate can provide during switching. 
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Fig. 8.12 Gate delay vs. load capacitance for three gate sizes ^, B, and C, in increasing order. 



Fig. 8.12 graphs load capacitance versus delay for versions A, B and C of a gate v, 
with different sizes (drive strengths), where 



size(vji) < size(vs) < size(vc) 
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A gate with larger size has lower output resistance and can drive a larger load 
capacitance with smaller load-dependent delay. However, a gate with larger size 
also has a larger intrinsic delay due to the parasitic output capacitance of the gate 
itself. Thus, when the load capacitance is large, 

t{vc) < tivs) < t(Vj) 

because the load-dependent delay dominates. When the load capacitance is small, 

Kva) < Kvb) < t{Vc) 

because the intrinsic delay dominates. Increasing size{v) also increases the gate 
capacitance of v, which, in turn, increases the load capacitance seen by fanin drivers. 
Although this relationship is not shown, the effects of gate capacitance on the delays 
of fanin gates will be considered below. 

Resizing transformations adjust the size of v to achieve a lower delay (Fig. 8.13). Let 
C{p) denote the load capacitance of pin p. In Fig. 8.13 (top), the total load 
capacitance drive by gate v is C{d) + C(e) + Of) = 3 IF. Using gate size A (Fig. 
8.13, lower left), the gate delay will be t{v^ = 40 ps, assuming the load-delay 
relations in Fig. 8.12. However, using gate size C (Fig. 8.13, lower right), the gate 
delay is t(yc) = 28 ps. Thus, for a load capacitance value of 3 fF, gate delay is 
improved by 12 ps if vc is used instead of v^. Recall that vc has larger input 
capacitance at pins a and b, which increases delays of fanin gates. Details of resizing 
strategies can be found in [8.34]. More information on gate sizing can be found in 
[8.33]. 





a ■ 
b 



_ Vcjp- 



I — d C{cl) = 1 .5 
-e C(e)=1.0 
- f C(f) = 0.5 



Fig. 8.13 Resizing gate v from gate size^ to size C(Fig. 8.12) can acliieve a lower gate delay. 



► 8.5.2 Buffering 



A buffer is a gate, typically two serially-connected inverters, that regenerates a 
signal without changing fiinctionality. Buffers can (1) improve timing delays 
either by speeding up the circuit or by serving as delay elements, and (2) modify 
transition times to improve signal integrity and coupling-induced delay variation. 
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In Fig. 8.14 (left), the (actual) arrival tiiTie at fanout pins d-h for gate vb is t{vB) = 
45 ps. Let pins d and e be on the critical path with required arrival times below 35 
ps, and let the input pin capacitance of buffer ;v be 1 fF. Then, adding j' reduces the 
load capacitance of v^ from 5 to 3, and reduces the arrival times at d and e to ^(va) 
= 33 ps. That is, the delay of gate Vg is improved by using y to shield v^ from 
some portion of its initial load capacitance. In Fig. 8.14 (right), after jv is inserted, 
the arrival time at pins/ g and h becomes ({vb) + t(y) = 33 + 33 = 66 ps. 
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Fig. 8.14 Improving ((v^) by inserting buffer j^ to partially shield v^'s load capacitance. 

A major drawback of buffering techniques is that they consume the available area 
and increase power consumption. Despite the judicious use of buffering by 
rnodem tools, the number of buffers has been steadily increasing in large designs 
due to technology scaling trends, where interconnect is becoming relatively slower 
compared to gates. In modem high-performance designs, buffers can comprise 
10-20% of all standard cell instances, and up to 44% in some designs [8.31]. 

>■ 8.5.3 Netlist Restructuring 

Often, the netlist itself can be modified to improve timing. Such changes should 
not alter the fiinctionality of the circuit, but can use additional gates or modify 
(rewire) the connections between existing gates to improve driving strength and 
signal integrity. This section discusses common netlist modifications. More 
advanced methods for restructuring can be found in [8.25]. 

Cloning (Replication). Duplicating gates can reduce delay in two situations - (1) 
when a gate with significant fanout may be slow due to its fanout capacitance, and 
(2) when a gate's output fans out in two different directions, making it impossible 
to find a good placement for this gate. The effect on cloning (replication) is to split 
the driven capacitance between two equivalent gates, at the cost of increasing the 
fanout of upstream gates. 



In Fig. 8.15 (left), using the same load-delay relations of Fig. 8.12, the gate delay 
t{y^ of gate v^ is 45 ps. However, In Fig. 8.15 (right), after cloning, t{v^ = 30 ps 
and ^(v^) = 33 ps. Cloning also increases the input pin capacitance seen by the 
fanin gates that generate signals a and b. In general, cloning allows more freedom 
for local placement, e.g., the instance va can be placed close to sinks d and e, 
while the instance vg can be placed close to sinks/ g and h, with the tradeoff of 
increased congestion and routing cost. 
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e C(e) = 1 
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Fig. 8.15 Cloning or duplicating gates to reduce maximum local fanout. 



When the downstream capacitance is large, buffering may be a better alternative 
than cloning because buffers do not increase the fanout capacitance of upstream 
gates. However, buffering cannot replace placement-driven cloning. An exercise 
at the end of this chapter expands further upon this concept. 

The second application of cloning allows the designers to replicate gates and place 
each clone closer to its downstream logic. In Fig. 8.16, v drives five signals d-h, 
where signals d, e and /are close, and g and h are located much farther away. To 
mitigate the large fanout of v and the large interconnect delay caused by remote 
signals, gate v is cloned. The original gate v remains with only signals d, e, and/, 
and a new copy of v (v ') is placed closer to g and h. 



a- 
b- 




bj^^y^ 
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■d 
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Fig. 8.16 Cloning transformation: a driving gate is duplicated to reduce remoteness of its fanouts. 

Redesign of fanin tree. The logic design phase often provides a circuit with the 
minimum number of logic levels. Minimizing the maximum number of gates on a 
path between sequential elements tends to produce a balanced circuit with similar 
path delays from inputs to outputs. However, input signals may arrive at varied 
times, so the minimum-level circuit may not be timing-optimal. In Fig. 8.17, the 
arrival time AAT(f) of pin /is 6 no matter how the input signals are mapped to 
gate input pins. However, the unbalanced network has a shorter input-output path 
which can be used by a later-arriving signal, where AAT{f) = 5. 
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Fig. 8.17 Redesigning a fanin tree to have smaller input-to-output delay. The arrival times are 
denoted in angular brackets, and the delay are denoted in parentheses. 
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Redesign of fanout tree. In the same spirit as Fig. 8.17, it is possible to improve 
timing by rebalancing the output load capacitance in a fanout tree so as to reduce 
the delay of the longest path. In Fig. 8.18, buffer yi is needed because the load 
capacitance of critical path path i is large. However, by redesigning the fanout tree 
to reduce the load capacitance of path i, use of the buffer yi can be avoided. 
Increased delay on pathj may be acceptable if that path is not critical even after 
the load capacitance of buffer jV2 is increased. 
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Fig. 8.18 Redesign of a fanout tree to reduce the load capacitance oipath\. 
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Swapping commutative pins. Although the input pins of, e.g., a two-input 
NAND gate are logically equivalent, in the actual transistor network they will 
have different delays to the output pin. When the pin node convention is used for 
STA (Sec. 8.2.1), the internal input-output arcs will have different delays. Hence, 
path delays can change when the input pin assignment is changed. The rule of 
thumb for pin assignment is to assign a later- (sooner-) arriving signal to an 
equivalent input pin with shorter (longer) input-output delay. 

In Fig. 8.19, the internal tirning arcs are labeled with corresponding delays in 
parentheses, and pins a, b, c and /are labeled with corresponding arrival times in 
angular brackets. In the circuit on the left, the arrival time at /can be improved 
from 5 to 3 by swapping pins a and c. 



a<0> 




c<2> 



f<5> 



Fig. 8.19 Swapping commutative pins to reduce the arrival time at/ 




f<3> 



More advanced techniques for pin assignment and swapping of commutative pins 
can be found in [8.9]. 
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Gate decomposition. In CMOS designs, a gate with multiple inputs usually has 
larger size and capacitance, as well as a more complex transistor-level network 
topology that is less efficient with respect to speed metrics such as logical effort 
[8.32]. Decomposition of multiple-input gates into smaller, more efficient gates 
can decrease delay and capacitance while retaining the same Boolean 
functionality. Fig. 8.20 illustrates the decomposition of a multiple-input gate into 
equivalent networks of two- and three-input gates. 








Om 



^y ^ 




---'^ 



Fig. 8.20 Gate decomposition of a complex networli into alternative networks. 

Boolean restructuring. In digital circuits. Boolean logic can be implemented in 
multiple ways. In the example of Fig. 8.21, fa.h.c) = (a + b)(a + c) = a + be 
(distributive law) can be exploited to improve timing when two functions have 
overlapping logic or share logic nodes. The figure shows two functions x ^ a + be 
and J ^ ab + e with arrival times AAT{a) = 4, AAT(b) = 1, and AAT{e) = 2. When 
implemented using a common node a + e, the arrival times of x and j' are AAT{x) = 
AATiy) = 6. However, implementing x and y separately achieves AAT{x) = 5 and 
AATfy) = 6. 
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Fig. 8.21 Restmcturing using logic properties, e.g., the distributive law, to improve timing. 



Reverse transformations. Timing optimizations such as buffering, sizing, and 
cloning increase the original area of the design. This change can cause the design to 
be illegal, as some new cells can now overlap with others. To maintain legality, 
either (1) perfomi the respective reverse operations unbuffering, downsizing, and 
merging, or (2) perform /»/acemen? legalization after all timing corrections. 
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8.6 Performance-Driven Design Flow 



The previous sections have presented several algorithms and techniques to improve 
timing of digital circuit designs. This section combines all these optimizations in a 
consistent performance-driven physical design flow, which seeks to satisfy timing 
constraints, i.e., "close on timing". Due to the nature of performance optimizations, 
their ordering is important, and their interactions with conventional layout 
techniques are subject to a number of subtle limitations. Evaluation steps, 
particularly STA, must be invoked several times, and some optimizations, such as 
buffering, must be redone multiple times to facilitate a more accurate evaluation. 

Baseline physical design flow. Recall that a typical design flow starts with chip 
planning (Chap. 3), which includes I/O placement, floorplanning (Fig. 8.22), and 
power planning. Trial synthesis provides the floorplanner with an estimate of the 
total area needed by modules. Besides logic area, additional whitespace must be 
allocated to account for buffers, routability, and gate sizing. 
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Fig. 8.22 A floorplan of a system-on-chip (SoC) design. Each major component is given 
dimensions based on area estimates. The audio and video components are adjacent to each other, 
given that their connections to other blocks and their performance constraints are similar. 



Then, logic synthesis and technology mapping produce a gate- (cell)-level netlist 
from a high-level specification, which is tailored to a specific technology library. 
Next, global placement assigns locations to each movable object (Chap. 4). As 
illustrated in Fig. 8.23, most of the cells are clustered in highly concentrated regions 
(colored black). As the iterations progress, the cells are gradually spread across the 
chip, such that they no longer overlap (colored light gray). 
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Fig. 8.23 The progression of cell spreading during global placement in a large, flat 
(non-floorplamied) ASIC design with fixed macro blocks. Darker shades indicate greater cell 
overlap while lighter shades indicate smaller cell overlap. 

These locations, however, do not have to be aligned with cell rows or sites, and can 
allow slight cell overlap. To ensure that the overlap is small, it is cornrnon to (1) 
establish a uniform grid, (2) compute the total area of objects in each grid square, 
and (3) lirnit this total by the area available in the square. 




Fig. 8.24 Buffered clock tree in a small CPU design. The clock source is in the lower left comer. 
Crosses (x) indicate sinks, and boxes (n) indicate buffers. Each diagonal segment represents a 
horizontal plus a vertical wire (Z,-shape), the choice of which can be based on routing congestion. 
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After global placement, the sequential elements are legalized. Once the locations of 
sequential elements are known, a clock network (Chap. 7) is generated. ASICs, SoCs 
and low-power (mobile) CPUs commonly use clock trees (Fig. 8.24), while 
high-performance microprocessors incorporate structured and hand-optimized clock 
distribution networks that may combine trees and meshes [8.30][8.31]. 

The locations of globally placed cells are first temporarily rounded to a unifonn grid, 
and then these rounded locations are connected during global routing (Chap. 5) and 
layer assignment, where each route is assigned to a specific metal layer. The routes 
indicate areas of wiring congestion (Fig. 8.25). This information is used to guide 
congestion-driven detailed placement and legalization of combinational elements 
(Chap. 4). 





k .^^^ 



Fig. 8.25 Progression of congestion maps through iterations of global routing. The light-colored 
areas are those that do not have congestion; dark-colored peaks indicate congested regions. Initially, 
several dense clusters of wires create edges that are far over capacity. After iterations of rip-up and 
reroute, the route topologies are changed, alleviating the most congested areas. Though more 
regions can become congested, the maximum congestion is reduced. 

While detailed placement conventionally comes before global routing, the reverse 
order can reduce overall congestion and wirelength [8.28]. Note that EDA flows 
require a legal placement before global routing. In this case, legalization will be 
performed after global placement. The global routes of signal nets are then assigned 
to physical routing tracks during detailed routing (Chap. 6). 

The layout generated during the place-and-route stage is subjected to reliability, 
manufacturability and electrical verification. During mask generation, each standard 
cell and each route are represented by collections of rectangles in a format suitable 
for generating optical lithography masks for chip fabrication. 



This baseline PD flow is illustrated in Fig. 8.26 with white boxes. 
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Fig. 8.26 Integrating optimizations covered in Chaps. 3-8 into a performance-driven design flow. 
Some tools bundle several optimization steps, which changes the appearance of the flow to users 
and often alters the user interface. Alternatives to this flow are discussed in this section. 
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Performance-driven physical design flow. Extending the baseline design flow, 
contemporary industrial flows are typically built around static timing analysis and 
seek to minimize the amount of change required to close on timing. Some flows start 
timing-driven optimizations as early as the chip planning stage, while others do not 
account for tiiTiing until detailed placement to ensure accuracy of tiiTiing results. This 
section discusses the timing-driven flow illustrated in Fig. 8.26 with gray boxes. 
Advanced methods for physical synthesis are found in [8.4]. 

Ciiip planning and logic design. Starting with a high-level design, performance- 
driven chip planning generates the I/O placement of the pins and rectangular blocks 
for each circuit module while accounting for block-level timing, and the power 
supply network. Then, logic synthesis and technology mapping produces a netlist 
based on delay budgets. 

Performance-driven chip planning. Once the locations and shapes of the blocks are 
determined, global routes are generated for each top-level net, and buffers are 
inserted to better estimate timing [8.2]. Since chip planning occurs before global 
placement or global routing, there is no detailed knowledge of where the logic cells 
will be placed within each block or how they will be connected. Therefore, buffer 
insertion makes optimistic assumptions. 

After buffering, STA checks the design for timing errors. If there are a sufficient 
number of violations, then the logic blocks must be re-floorplanned. In practice, 
modifications to existing floorplans to meet timing are perfonned by experienced 
designers with little to no automation. Once the design has satisfied or mostly met 
timing constraints, the I/O pins can be placed, and power (VDD) and ground (GND) 
supply rails can be routed around floorplan blocks. 

Timing budgeting. After perfonnance-driven floorplanning, delay budgeting sets 
upper bounds on setup (long path) timing for each block. These constraints guide 
logic synthesis and technology mapping to produce a performance-optimized 
gate-level netlist, using standard cells from a given library. 

Block-level or top-level global placement. Starting at global placement, timing- 
driven optimizations can be performed at the block level, where each individual 
block is optimized, or top level, where transformations are global, i.e., cross block 
boundaries, and all movable objects are optimized."* Block-level approaches are 
usefiil for designs that have many macro blocks or intellectual properties {IPs) that 
have already been optimized and have specific shapes and sizes. Top-level 
approaches are useful for designs that have more freedom or do not reuse 
previously-designed logic; a hierarchical methodology offers more parallelism and 
is more common for large design teams. 



In hierarchical design flows, different designers concurrently perfomi top-level placement and 
block-level placement. 
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Buffer insertion. To better estimate and improve timing, buffers are inserted to break 
any extremely long or high fanout nets (Sec. 8.5.2). This can be done either 
physically, where buffers are directly added to the placement, or virtually, where the 
impact of buffering is included in delay models, but the netlist is not modified. 

Physical buffering. Physical buffering [8.1] performs the full process of buffer 
insertion by (1) generating obstacle-avoiding global net topologies for each net, (2) 
estimating which metal layers the route uses, and (3) actually inserting buffers (Fig. 

8.27). 



T 






(a) 

Fig. 8.27 Physical buffering for timing estimation, (a) A five-pin net is routed witli a minimum 
Sterner tree topology that does not avoid a routing obstacle (shown in gray), (b) The net routed with 
an obstacle-avoiding Steiner tree topology, (c) The buffered topology offers a relatively accurate 
delay estimation. 

Virtual buffering [8.24], on the other hand, estimates the delay by modeling every 
pin-to-pin connection as an optimally buffered line with linear delay [8.23] as 

ti^o (.net) = L{net) ■ {r{B) ■ C(w) + R{w)- C{B) + ^2- R{B)-C{B)- R{w)-C{w)) 

where net is the net, L(net) is the total length of net, R(B) and C(B) are the respective 
intrinsic resistance and capacitance of the buffer, and R(w) and C(m') are the 
respective unit wire resistance and capacitance. Though no buffers are added to the 
nethst, they are assumed for timing purposes. When timing information becomes 
more accurate, subsequent re -buffering steps often remove any existing buffers and 
re-insert them from scratch. In this context, virtual buffering saves effort, while 
preserving the accuracy of timing analysis. Physical buffering can avoid 
unnecessary upsizing of drivers and is more accurate than virtual buffering, but also 
more time-consuming. 



Once buffering is complete, the design is checked for timing violations using static 
timing analysis (Sec. 8.2.1). Unless timing is met, the design returns to buffering, 
global placement, or, in some cases, to logic synthesis. When timing constraints are 
mostly met, the design moves on to timing correction, which includes gate sizing 
(Sec. 8.5.1) and timing-driven netlist restructuring (Sec. 8.5.3). Subsequently, 
another timing check is performed using STA. 
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Physical synthesis. After buffer insertion, physical synthesis apphes several timing 
correction techniques (Sec. 8.5) such as operations that modify the pin ordering or 
the netlist at the gate level, to improve delay on critical paths. 

Timing correction. Methods such as gate sizing increase (decrease) the size of a 
physical gate to speed up (slow down) the circuit. Other techniques such as redesign 
offanin andfanout trees, cloning, and pin swapping reduce timing by rebalancing 
existing logic to reduce load capacitance for timing-critical nets. Transformations 
such as gate decomposition and Boolean restructuring modify logic locally to 
improve timing by merging or splitting logic nodes from different signals. After 
physical synthesis, another timing check is performed. If it fails, another pass of 
timing correction attempts to fix timing violations. 

Routing. After physical synthesis, all combinational and sequential elements in the 
design are connected during global and clock routing, respectively. First, the 
sequential elements of the design, e.g., flip-flop and latches, are legalized (Sec. 4.4). 
Then, clock network synthesis generates the clock tree or mesh to connect all 
sequential elements to the clock source. Modem clock networks require a number of 
large clock buffers;^ perfomiing clock-network design before detailed placement 
allows these buffers to be placed appropriately. Given the clock network, the design 
can be checked for hold-time (short path) constraints, since the clock skews are now 
known, whereas only setup (long path) constraints could be checked before. 

Layer assignment. After clock-network synthesis, global routing assigns global route 
topologies to connect the combinational elements. Then, layer assignment matches 
each global route to a specific metal layer. This step improves the accuracy of delay 
estimation because it allows the use of appropriate resistance-capacitance (RC) 
parasitics for each net. Note that clock routing is performed before signal-net routing 
when the two share the same metal layers - clock routes take precedence and should 
not detour around signal nets. 

Timing-driven detailed placement The results of global routing and layer 
assignment provide accurate estimates of wire congestion, which is then used by a 
congestion-driven detailed placer [8.10][8.35]. The cells are (1) spread to remove 
overlap among objects and decrease routing congestion, (2) snapped to standard-cell 
rows and legal cell sites, and then (3) optimized by swaps, shifts and other local 
changes. To incorporate timing optimizations, either perform (1) non-timing-driven 
legalization followed by timing-driven detailed placement, or (2) perform 
timing-driven legalization followed by non-timing-driven detailed placement. After 
detailed placement, another timing check is performed. If timing fails, the design 
could be globally re-routed or, in severe cases, globally re-placed. 

To give higher priority to the clock network, the sequential elements can be 
legalized first, and then followed by global and detailed routing. With this approach. 



^ These buffers are legalized immediately when added to the clock network. 
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signal nets must route around the clock network. This is advantageous for 
large-scale designs, as clock trees are increasingly becoming a perfomiance 
bottleneck. A variant flow, such as the industrial flow described in [8.28], first fully 
legalizes the locations of all cells, and then performs detailed placement to recover 
wirelength. 

Another variant performs detailed placement before clock network synthesis, and 
then is followed by legalization and several optimization steps.* After the clock 
network has been synthesized, another pass of setup optimization is performed. Hold 
violations may be addressed at this time or, optionally, after routing and initial STA. 

Timing-driven routing. After detailed placement, clock network synthesis and 
post-clock network optimization, the timing-driven routing phase aims to fix the 
remaining timing violations. Algorithms discussed in Sec. 8.4 include generating 
minimum-cost, minimum-radius trees for critical nets (Sees. 8.4.1-8.4.2), and 
minimizing the source-to-sink delay of critical sinks (Sec. 8.4.3). 

If there are still outstanding timing violations, fiirther optimizations such as 
re -buffering and late timing corrections are applied. An alternative is to have 
designers manually tune or fix the design by relaxing some design constraints, using 
additional logic libraries, or exploiting design structure neglected by automated tools. 
After this time-consuming process, another timing check is performed. If timing is 
met, then the design is sent to detailed routing, where each signal net is assigned to 
specific routing tracks. Typically, incremental STA-driven Engineering Change 
Orders (ECOs) are apphed to fix timing violations after detailed placement; this is 
followed by ECO placement and routing. Then, 2.5D or 3D parasitic extraction 
determines the electromagnetic impact on timing based on the routes' shapes and 
lengths, and other technology-dependent parameters. 

Signoff. The last few steps of the design flow validate the layout and timing, as well 
as fix any outstanding errors. If a timing check fails, ECO minimally modifies the 
placement and routing such that the violation is fixed and no new errors are 
introduced. Since the changes made are very local, the algorithms for ECO 
placement and ECO routing differ from the traditional place and route techniques 
discussed in Chaps. 4-7. 

After completing timing closure, manufacturahility, reliability and electrical 
verification ensure that the design can be successfiilly fabricated and will function 
correctly under various environmental conditions. The four main components are 
equally important and can be performed in parallel to improve runtime. 

— Design Rule Checking (DRC) ensures that the placed-and-routed layout meets 
all technology-specified design rules e.g., minimum wire spacing and width. 



' These include post-clock-network-synthesis optimizations, post-global-routing optimizations, and 
post-detailed-routing optimizations . 
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— Layout vs. Schematic (LVS) checking ensures the placed-and-routed layout 
matches the original netlist. 

— Antenna Checks seek to detect undesirable antenna effects, which may damage 
a transistor during plasma-etching steps of manufacturing by collecting excess 
charge on metal wires that are connected to PN-junction nodes. This can occur 
when a route consists of multiple metal layers and a charge is induced on a 
metal layer during fabrication. 

— Electric Rule Checking (ERC) finds all potentially dangerous electric 
connections, such as floating inputs and shorted outputs. 

Once the design has been physically verified, optical-lithography masks are 
generated for manufacturing. 
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This chapter explained how to combine timing optimizations into a comprehensive 
physical design flow. In practice, the flow described in Sec. 8.6 (Fig. 8.26) can be 
modified based on several factors, including 

— Design type. 

ASIC, microprocessor, IP, analog, mixed-mode. 

- Datapath-heavy specifications may require specialized tools for structured 
placement or manual placement. Datapaths typically have shorter wires 
and require fewer buffers for high-performance layout. 

— Design objectives. 

High-performance, low-power or low-cost. 

- Some high-performance optimizations, such as buffering and gate 
sizing, increase circuit area, thus increasing circuit power and chip cost. 

— Additional optimizations. 

Retiming shifts locations of registers among combinational gates to better 
balance delay. 

- Useful skew scheduling, where the clock signal arrives earlier at some 
flip-flops and later at others, to adjust timing. 

Adaptive body-biasing can improve the leakage current of transistors. 

— Additional analyses. 

- Multi-corner and multi-mode static timing analysis, as industrial ASICs 
and microprocessors are often optimized to operate under different 
temperatures and supply voltages. 

Thermal analysis is required for high-performance CPUs. 

— Technology node, typically specified by the minimum feature size. 

Nodes < 180 nm require timing-driven placement and routing flows, as 
lumped-capacitance models are inadequate for performance estimation. 
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— Nodes < 130 nm require timing analysis with signal integrity, i.e., 
interconnect coupling capacitances and the resulting delay increase 
(decrease) of a given victim net when a neighboring aggressor net 
switches simultaneously in the opposite (same) direction. 

— Nodes < 90 nm require additional resolution enhancement techniques 
{RET) for lithography. 

Nodes < 65 mii require power-integrity (e.g., IR drop-aware timing, 
electromigration reliability) analysis flows. 

Nodes < 45 nm require additional statistical power-perfonnance tradeoffs 
at the transistor level. 

— Nodes < 32 nm impose significant limitations on detailed routing, known 
as restricted design rules (RDRs), to ensure manufacturability. 

— Available tools. 

In-house software, commercial EDA tools [8.34]. 

— Design size and the extent of design reuse. 

Larger designs often include more global interconnect, which may 

become a perfonnance bottleneck and typically requires buffering. 

IP blocks are typically represented by hard blocks during floorplanning. 

— Design team size, required time-to-market, available computing resources. 

— To shorten time-to-market, one can leverage a large design team by 
partitioning the design into blocks and assigning blocks to teams. 

— After floorplanning, each block can be laid out in parallel; however, flat 
optimization (no partitioning) sometimes produces better results. 

Reconfigurable fabrics such as FPGAs require less attention to buffering, due to 
already-buffered programmable interconnect. Wire congestion is often negligible for 
FPGAs because interconnect resources are overprovisioned. However, FPGA 
detailed placement must satisfy a greater number of constraints than placement for 
other circuit types, and global routing must select from a greater variety of 
interconnect types. Electrical and manufacturability checks are unnecessary for 
FPGAs, but technology mapping is more challenging, as it affects the area and 
timing to a greater extent, and can benefit more from the use of physical information. 
Therefore, modern physical-synthesis flows for FPGAs perform global placement, 
often in a trial mode, between logic synthesis and technology mapping. 

Physical design flows will require additional sophistication to support increasing 
transistor densities in semiconductor chips. The advent of future technology nodes - 
28 nm, 22 nm and 16 nm - will bring into consideration new electrical and 
manufacturing-related phenomena, while increasing xmcertainty in device 
parameters [8.22]. Further increase in transistor counts may require integrating 
multiple chips into three-dimensional integrated circuits, thus changing the geometry 
of fundamental physical design optimizations [8.36]. Nevertheless, the core 
optimizations described in this chapter will remain vital in chip design. 
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Chapter 8 Exercises 

Exercise 1: Static Timing Analysis 

Given the logic circuit below, draw the timing graph (a), and determine the (b) AAT, 
(c) RAT, and (d) slack of each node. The AATs of the inputs are in angular brackets, 
the delays are in parentheses, and the RAT of the output is in square brackets. 



a<0> 



■(0.75) 



ib<0.15>-t-(0.15Hx(1 



T 



■(0.2) 
c<0.3> (0.3) 




Exercise 2: Timing-Driven Routing 

Given the terminal locations of a signal net, assume that all distances are Manhattan, 
and that no Steiner point is used when routing. Construct the spanning tree T, and 
calculate radius{T) and cost{T), for each of the following. 

(a) Prim-Dijkstra tradeoff with y = (Prim's MST algorithm). 

(b) Prim-Dijkstra tradeoff with y = 1 (Dijkstra's algorithm). 

(c) Prim-Dijkstra tradeoff with y = 0.5. 



• - - • 

o 

So 

« • 



Exercise 3: Buffer Insertion for Timing Improvement 

For the logic circuit and load capacitances on the following page, assume that 
available gate sizes and timing performances are similar to those in Fig. 8.12. 
Assume that gate delay always increases linearly with load capacitance. Let the 
input capacitance of buffer y be 0.5 fF, 1 fF, and 2 fF with sizes A, B and C, 
respectively. Determine the size for buffer j' that minimizes the AAT of sink c. 
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a 
b 




^ 



c C{c) = 2.5 
d C(d) = 1 .5 
e C(e) = 0.5 



Exercise 4: Timing Optimization 

List at least two timing optimizations covered only in this chapter (not mentioned 
beforehand). Describe these optimizations in your own words and discuss scenarios 
in which (1) they can be useful and (2) they can be harmful. 

Exercise 5: Cloning vs. Buffering 

List and explain scenarios where cloning results in better timing improvements than 
buffering, and vice-versa. Explain why both methods are necessary for 
timing-driven physical synthesis. 

Exercise 6: Physical Synthesis 

In terms of timing corrections such as buffering, gate sizing, and cloning, when are 
their reverse transformations useful? In what situations will a given timing 
correction cause the design to be illegal? Explain for each timing correction. 
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A Solutions to Chapter Exercises 



Chapter 2: Netlist and System Partitioning 



Exercise 1: KL Algorithm 

Maximum gains and the swaps of nodes with positive gain per iteration / are given. 

/ = 1: Agi = D(a) + D{f) - 2c{af) =2+1-0 = 3— > swap nodes a and/ G\ = 3. 
; = 2: Ag2 = D{c) + Z)(e) - 2c(c,e) =-2 + 0-0 = -2— > swap nodes c and e, G2 = 1 . 
/ = 3: Ag3 = D{h) + D{d) - 2c{b,d) =-1+0-0 = -1^ swap nodes b and d, G3 = 0. 

Maximum gain is G,„ = 3 with m = \. 
Therefore, swap nodes a and/ 
Graph after Pass 1 (right). 




Exercise 2: Critical Nets and Gain During the FM Algorithm 

(a) Table recording (1) the cell moved, (2) critical nets before the move, (3) critical 
nets after the move, and (4) which cells require a gain update. 



Cell 


Critical nets 


Critical nets 


Which cells require 


moved 


before the move 


after the move 


a gain update 


a 


— 


M 


b 


b 


~ 


Ni 


a 


c 


~ 


M 


d 


d 


^3 


NuN3 


c,h,i 


e 


N2 


N2 


f,g 


f 


N2 


N2 


e,g 


g 


N2 


N2 


ef 


h 


N, 


N, 


d,i 


i 


N, 


N, 


d,h 



(b) Gains for each cell. 



Agi(a) = Agi(^) = Ag,(c) = 

Agi(^ = Ag,(e) = -1 Ag,(/) = -l 

Ag,(g) = -1 Ag,(/7)=1 Ag,(0 = O 



A. B. Kahng et al., VLSI Physical Design: From Graph Partitioning to Timing Closure, 
DOI 10.1007/978-90-481-9591-6, © Springer Science+Business MediaB.V. 2011 
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Exercise 3: FM Algorithm 

Pass 2, iteration / = 1 

Gain values: A^i(a) = - 1 , Agi(Z)) = - 1 , Agi(c) = 0, Agx(d) = 0, Agi(e) = - 1 . 

Cells c and d have maximuiTi gain value Ag\ = 0. 

Balance criterion after moving cell c: area(A) = 4. 

Balance criterion after moving cell d: area(A) = 1 . 

Cell c iTieets the balance criterion better. 

Move cell c, updated partitions: Ai = {d}, Bx = {a,b,c,e}, with fixed cells {c}. 

Pass 2, iteration / = 2 

Gain values: Agjia) = -2, AgzC^) = -2, Ag2(^ = 2, Ag2(e) = - 1 . 

Cell d has maximum gain Ag2 ^ 2, flrea(^) = 0, balance criterion is violated. 

Cell e has next iTiaxiiTium gain Ag2 = -1, area{A) = 9, balance criterion is met. 

Move cell e, updated partitions: A2 = {d,e), B2 = {a,b,c}, with fixed cells {c,e} . 

Pass 2, iteration / = 3 

Gain values: Ag3(a) = 0, Ag2(b) = -2, Ag3(fiO = 2. 

Cell rf has maximum gain Ag3 = 2, area(A) = 5, balance criterion is met. 

Move cell d, updated partitions: A^ = {e}, B3 = {a,b,c,d}, with fixed cells {c,d,e}. 

Pass 2, iteration ; = 4 

Gain values: Ag4(a) = -2, Ag4(Z?) = -2. 

Cells a and fo have maximum gain Ag4 = -2. 

Balance criterion after moving cell a: area{A) = 7. 

Balance criterion after moving cell b\ area{A) = 9. 

Cell a meets the balance criterion better. 

Move cell a, updated partitions: ^4 = {a,e},B4^ {Z),c,t/}, with fixed cells {a,c,d,e}. 

Pass 2, iteration ; = 5 

Gain values: Ags(b)^l. 

Balance criterion after moving b: area(A) = 11. 

Move cell b, updated partitions: ^5= {a,b,e}, B^^ {c,d}, with fixed cells {a,b,c,d,e}. 

Find best move sequence <Ci ... c„,> 
G,=Agi = 
G2 = Agi+Ag2 = -l 

G3 = Agi+Ag2 + Ag3= 1 

G4 = Agi + Ag2 + Ag3 + Ag4 = -1 

Gs = Agi + Ag2 + Ag3 + Ag4 + Ag5 = 
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Maximum positive gain G„, = 1 occurs when m = 3 . 
Cells c, e and d are moved. 

The result after Pass 2 is illustrated on the right. 



6 




J- e A 



Exercise 4: System and Netlist Partitioning 

One key difference is that traditional min-cut partitioning only accounts for mini- 
mizing net costs across k partitions. FPGA-based partitioning involves first deter- 
mining the number of devices and then minimizing the total communication be- 
tween devices as well as the device logic. For instance, traditional min-cut 
partitioning does not distinguish how many devices a /7-pin net is split across. Fur- 
thermore, min-cut partitioning does not account for FPGA reconfigurability. 

Exercise 5: Multilevel FM Partitioning 

One major advantage is scalability. Traditional FM partitioning scales to -200 nodes 
whereas multilevel FM can efficiently handle large-scale modem designs. The 
coarsening stage clusters nodes together, thereby reducing the number of nodes that 
FM interacts with. FM produces near-optimal solutions for netlists with fewer than 
200 nodes, but solution quality deteriorates for larger netlists. In contrast, multilevel 
FM produces great solution quality without sacrificing large amounts of runtime. 



Exercise 6: Clustering 

Nodes that have either single connections to multiple other nodes or multiple con- 
nections to a single node are candidates for clustering. If a net net is contained 
within a single partition, then net does not contribute to the cut cost of the partition. 
\inet spans multiple partitions, then one option is to place «e?'s cluster in the parti- 
tion where «e?'s net weight is the greatest. Another option is to limit the size of the 
clusters such that the individual nodes oinet are clustered within each partition. 
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Solutions to Chapter Exercises 



Chapters: Chip Planning 

Exercise 1: Slicing Trees and Constraint Graplis 

Slicing Tree 




Vertical Constraint Graph (VCG) 




Horizontal Constraint Graph (HCG) 




Exercise 2: Floorplan-Sizing Algoritlim 

(a) Shape functions for blocks a (left), h (center) and c (right). 

3 



\l 



sHria 



1-- 



i i I \*w 1 1 b I 
1 4 




13 4 4 



n 



I I I \*w 

1 4 
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(b) Shape function of the floorplan. 

Horizontal composition: Determine h(aj,){w) of blocks a and b. 

S ■ - A — 4- ■* — -I + -^ = 4 -^ -I- I 1— .■ 



3 
1 + 



I 



■1+3 = 4 



3-- 



I I I \>w 
12 3 4 



L 



I I I !► W 



Vertical composition: Determine /!((a,6),e)(w) and the minimum-area comer points. 
The minimum area of ((a,b),c) is 16, with dimensions either being \a,b),c) ^ 8 and 
W{(a,b),c) = 2, or /!((a,6),r) = 8 and Wi(a,b),c) = 4. 



h I I "c"{a,fe)"((a,()),c) 

i y y 
4 + 4 = 8 



4-- 
3-- 

1-- 



ft 



1+3 = 4 



1 2 4 



({a.b).c) 




I I I l>W 



Construct floorplans by backtracing the shape functions of hc{w), h(a.bi}v), ^a(>i') and 
ht{w) to determine the dimensions of each block. The following shows the backtrace 
for the comer point (2,8) from /?((„ j) c)(w). 
h 




(.a.b) 



a,b 



1 X 4 
I 



4-- 
3-- 



IP 



I I I l>iv 



1— IV, 



1 2 



■1X3 



1 X 4 



The following shows the backtrace for the corner point (4,4) from /?((„ t) c)(w). 
h 



p i 

3"" i ~""^(aW~3 3- 


.^-k: pw, = 3 ^ 


1 1 1 1 i>w H 


\*\ 1 l» IV 1 



■1X3 



1 3 4 



3X3 
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Solutions to Chapter Exercises 



Exercise 3: Linear Ordering Algorithm 



Iteration 

# 



Block 



New 
Nets 



Terminating 

Nets 



gain 



Continuing 

Nets 







N,,N2,N,,N, 



1 



b 
c 
d 
e 






N2 
N4 



-1 
-1 
-1 

+1 



N3 
N3 



2 


h 


Ns,N, 


N2 


-1 


— 




c 


Ns 


- 


-1 


N, 




d 


Ns,N, 


N4 


-1 


N, 


3 


b 


— 


N2M 


+2 


Ns 




c 


— 


N, 


+1 


N, 


4 


c 


- 


N3,Ns 


+2 


- 



For each iteration, bold font denotes the block with the maximum gain. 

Iteration 0: set block a as the first block in the ordering. 

Iteration 1 : block e has iTiaximum gain. Set as the second block in the ordering. 

Iteration 2: blocks b, c and d all have maxiiTium gain of -1. Blocks b and d each have 

one terminating net. Block d has a higher number of continuing nets. Set block d as 

the third block in the ordering. 

Iteration 3 : block b has maxiiTiuiTi gain. Set as the fourth block in the ordering. 

Iteration 4: set block c as the fifth (last) block in the ordering. 

The linear ordering that heuristically minimizes total net cost is<a e d b c>. 







N, 








Ns 












N2 










1 










1 


1 












a 




e 




d 


b 




c 




1 






W3 




1 








1 






W4 




Ne 









Exercise 4: Non-Slicing Floorplans 

There are many possible non-slicing floorplans using four blocks a-d. The following 
is one such solution. 



b 

a 1 — 

c 

d 
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Chapter 4: Global and Detailed Placement 



Exercise 1: Estimating Total Wirelengtli 

(a) Tree representations of the five-pin net. 
Minimum-Length Chain Minimum Spanning Tree 
.e _e 



d 

- f . 

c 



d 

b 

1 1 I I 

— .^J 



Steiner Minimum Tree 



d 

b 

1 ( I I 

•" I \ 



(b) Linet)= Y^w(e)-dMie) 

Lcham{nei) = w{d,e) ■ diJ(d,e) + w{d,c) ■ dM{d,c) + w{c,h) ■ duicjj) + w{b,a) ■ dM{b,a) 

= 2-3 + 2-4 + 2-2 + 2-2 = 22 
LMsiinet) = w(d,e) ■ diJd,e) + w{e,h) ■ duiejj) + w{c,h) ■ dijfijj) + w{b,a) ■ dM(h,a) 

= 2-3 +2-3 +2-2 + 2-2 = 20 
LsMiinet) = w{d,e,h) ■ dfj(d,e,b) + w{c,b) ■ diJ(C,b) + w{h,a) ■ diJJb,a) 

= 2-5 + 2-2 + 2-2 =18 

Exercise 2: Min-Cut Placement 

After the initial partitioning (vertical cut cut{): 

Cells left oicuti. L = {c,d/,g], cells right oicutx. R = {a,b,e). Cut cost: L-R = 1. 

Possible solutions after the second partitioning (two horizontal cuts cutji and cut2R}'- 
Top Left: TL = {c,g}. Bottom Left: BL = {d/}. Cut cost: TL-BL = 2. 
Top Right: 77?= {a}. Bottom Right: BR = {b,e}. Cut cost: TR-BR = 1. 

After the third partitioning (four vertical cuts cut^ri, ciU^bl, cuI^tr and cuI^br), one 
possible result is 




CUt^SL CUt^ CUt^BR 



The maximum number of nets that cross an edge r|p(e) = 2, and the edge capacity 
o(e) = 2. Therefore, since <t>(P) = 1, the design is estimated to be fully routable. 
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Exercise 3: Force-Directed Placement 

Solve for xf and Xft": 

_ c(aj)*o _ c{a,Inl) ■ Xj„i + c{a,In2) ■ X/,,2 + c{a,b) ■ x^ 



V' c{a, j) c{a, Inl) + c(a, In2) + c(a, b) 



2 + 2 + 4 8 ■ '' 

,.0 



Y,c(b,J)-x'^. 



_ c{b,j)*o _ c(b, a)-x^+ c(b, Inl) ■ Xy„2 + c(b, Out) ■ x^,,, 



y c{b, j) c{b, a) + c{b, Inl) + c{b. Out) 

c(h,J)*0 



4 + 2 + 2 



Xj =0.5 + 0.5x^J 
x;j=0.5x°= 0.5(2/3) = 1/3 



Rounded, xf ~ and xj" = 1 . 
'^c{aj)y° 



Solve foryf and j^a" 



_ c(a,j)*o c(a, Inl) ■ yj„i + c(a, Inl) ■yj„2+ c(a, b) ■ yf, 

2_\c{a,j) c(a,Inl) + c{a,Inl) + c{a,b) 

^2.2 + 2.0 + 4.,° ^^3^ 
2+2+4 * 

^0 _ c(b,j)^o _ c(b,a) ■ y^ + c(b,Inl) ■ y,„2 + c(b,Out) ■ yo^, 

Vb 



y c{b, j) c(b, a) + c(b, Inl) + c(b. Out) 

c{bj)*0 

4 + 2 + 2 



'° ^'JiVf'o]yt-^-25^^-5i0.5.05y^,)-in 
y,, =0.25 + 0.5vJ 

7° = 0.5 + 0.5j° = 0.5 + 0.5(2 / 3) = 5 / 6 



Rounded, yf ~ 1 and jj" = 1 . 
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The ZFT-positions for gates a and b are (0,1) and 
(1,1), respectively. 



2 r /n1 




lTOg^ 


0^'"2 



1 



Exercise 4: Global and Detailed Placement 

Global placement assigns cells or moveable blocks to bins of a coarse grid. In con- 
trast, detailed placement moves cells within each grid bin such that cells do not 
mutually overlap (legalization). Typically, placement is split into two steps to ensure 
scalability. Legalization requires a large amount of runtime and cannot be applied 
after every iteration of global placement. A much faster method is to assign blocks 
to grid bins first and, when all blocks have been assigned, legalize afterward. Fur- 
thermore, detailed placement can be easily perfomied in parallel, as each grid bin 
can be processed independently. Typically, the placement process is spht into two 
parts to ensure scalability. 
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Solutions to Chapter Exercises 



Chapter 5: Global Routing 



Exercise 1: Steiner Tree Routing 

(a) All Hanan points and the minimum bounding box for the six-pin net. 



Hanan Points (30) 

; I I ip, ; 
(j) o <> ^ < 

:Pi ! : : : , 
* - Cf o o t> < 



iPs 

Pe 

(b) Sequential Steiner tree heuristic. 
Pi 



Minimum Bounding Box (MBB) 

P2 



P3 
1 O • O < 

p^^ ^ ^ ^ . 




IT 


1 o 6 6 < 


Pu 





Ps 






Pe 





— 1 ' 

o 

P1 




P3 












j 































^ ( 


f' 
























! 


P4 














< 




^ 







^ ^'^ 
































Pb 


Pa 

) 


















F' 



















Pi 



\ 




^n 






p^ 






1 








j 






,P4 


^ 




j 








^ h- 












Pe 



1 ( 
























^ 










Pc 








Pe 



, ( 


















1 




Pb 








i 






Ps 


Pc ^ 




^' J 



Pe 



(c) Three Steiner points each with degree three. 

(d) A three-pin net can have a maximum of one Steiner point. 
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Exercise 2: Global Routing in a Connectivity Graph 

After routing net A. 



®(E 


) 




It 




T 


[ 




_. 1 .... 






® ® 



After routing net B. 



The given placement is routable. 
Exercise 3: Dijkstra's Algorithm 
(3,2) 





2,2 — 2,2 



(aXb 


i 






) 






y^ ' 






T 


1 






x^ 






(^ 


3) (A) 








1 




Group 2 Group 3 



<a>M(2,2)- 
<a>[c/] (3,2), 



<b> [c] (8,7) 
^'b^ [c] (7,5) 



\h<a>[fe](2,2) 



<c/>[e](4,3), 
<d>[g](8,5). 



0[cG(3,2) 



<e> [fl (5,6)-, 
<e> [h] (6,4),~ 

^^^ [g] (0,5) 
<h> [;] (8,6), 



I 



, ^^@[e](4,3) 

V^T^ 

\->^^e^[/7](6,4) 



</^ [c] (6,7)- 
<f^ H (9,0) 




'. <e>[/](5,6) 



^<f>[c] (6,7) 



d> [g] (8,5) 



^ @[i](8,6) 
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Solutions to Chapter Exercises 



Exercise 4: ILP-Based Global Routing 

For net^, the possible routes are two Z-shapes (Ai^j)- 

Net Constraints 

Xa, +Xa<1 



"a fv 

II 

i "^ i ' 

II i 6 i 



Variable Constraints 

0<X^;<l 

0<Xa2<1 



For net B, the possible routes are two Z-shapes {B\,Bt). 

Net Constraints 



Bo 



B 



B B, 



Variable Constraints 

0<X5j<l 

0<X52<1 



For net C, the possible routes are two Z-shapes (Ci,C2). 

Net Constraints 

Xci+Xc2<l 



1 1 ( 

o c I 



Variable Constraints 

0<xc|<l 

0<XC2<1 



Xq 

XC[ 
Xfl 



Horizontal Edge Capacity Constraints: 

G(0,0)~G(1,0) 

G(1,0)~G(2,0) 

G(2,0) ~ G(3,0) 

G(3,0) ~ G(4,0) 

G(0,1)~G(1,1) 
G(1,1)~G(2,1) 
G(2,1)~G(3,1) 
G(3,1)~G(4,1) 
G(0,2)~G(1,2) 
G(1,2)~G(2,2) 
G(0,3) ~ G(l,3) 
G(1,3)~G(2,3) 



1 

XSj 

XA2 
XA2 
Xflj 
Xflj 
XC2 
XC2 
X/lj 
X^j 



Vertical Edge Capacity Constraints: 



G(0,0)~G(0,1) 
G(2,0)~G(2,1) 
G(4,0)~G(4,1) 
G(0,1)~G(0,2) 
G(2,1)~G(2,2) 
G(0,2)~G(0,3) 
G(2,2) ~ G(2,3) 



XC2 

XSj + XCi 

XS[ 

X-lj + Xcj 

X^[ +Xcj 

X^2 
X^j 



< 


o(G(0,0 


)~G(1,0))=1 


< 


o(G( 1,0 


)~G(2,0))=1 


< 


o(G(2,0 


)~G(3,0))=1 


< 


o(G(3,0 


)~G(4,0))=1 


< 


o(G(0,l 


)~G(1,1))=1 


< 


o(G( 1,1 


)~G(2,1))=1 


< 


o(G(2,l 


)~G(3,1))=1 


< 


o(G(3,l 


)~G(4,1))=1 


< 


o(G(0,2 


)~G(1,2))=1 


< 


o(G( 1,2 


)~G(2,2)) = 1 


< 


o(G(0,3 


)~G(1,3))=1 


< 


o(G(l,3 


)~G(2,3)) = 1 


< 


o(G(0,0 


)~G(0,1))=1 


< 


o(G(2,0 


)~G(2,1))=1 


< 


o(G(4,0 


)~G(4,1))=1 


< 


o(G(0,l 


)~G(0,2))=1 


< 


o(G(2,l 


)~G(2,2))=1 


< 


o(G(0,2 


)~G(0,3))=1 


< 


o(G(2,2 


)~G(2,3))=1 
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Objective Fimction: Maximize x^j + Xaj + x^^ + xg^ + xc^ + xcj 

The solution is routable with routes A2, By, and C\ (right). 



ta — n 
^^' ^ < 

II us 1 



B 



B B, 



Exercise 5: Shortest Path with A* Search 

Remove the top right obstacle yields the following A* search progression. 



' H2© 



Exercise 6: Rip-Up and Reroute 

Estimated memory usage is 0{m^ ■ n). The number of net segments per net is 0{m^). 
Since the number of nets is n, the tight upper bound of memory usage is Oitn' ■ n). 
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Solutions to Chapter Exercises 



Chapter 6: Detailed Routing 



Exercise 1: Left-Edge Algorithm 

(a) Sets S{coiy. 
Sia) = {A,B} 
S{b)-{A,B,C} 
S{c)-{A,C/)} 
Sid) = {A,CJD} 
S{e)-{C,D^} 

Sig) - {E,F} 
S{h)-{F) 



Maximal ^(co/): 

S{b)={A,B,Q 

S{c) = {A,C,D} or S{d) = {A,C,D} 

S{e) = {C,D,E} 



Minimum number of required tracks = \S{h)\ = \S{c)\ = \S{d)\ = \S{e)\ = \S{f)\ = 
(b) Horizontal constraint graph (HCG) and vertical constraint graph (VCG). 
Horizontal Constraint Graph (HCG) 




(c) Routed channel. 



Vertical Constraint Graph (VCG) 



ABAOEDOF 



currjrack = 1 
2 
3 















































BCDACFEQ 
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Exercise 2: Dogleg Left-Edge Algorithm 

(a) Vertical constraint graph (VCG) without net sphtting. 

.A, 



\D 



(b) After sphtting nets A, C and D: {Ai42Ai,B,CuC2,Di,D2,E} . 



a b c d e f g h Sets Sicol): 
AABOADCE S(a) = {^i} 

Sib) = {Ai42,B} 
S(c)={A2,B,C,} 



^A.—A^- 



'-A3* 



— C,-^—C,- 



>-D^-^D2* 



Maximal S{col): 



Sib) = {Ai42,B} 

S{c)={A2,B,Q} 

S{d)={A2A3,Ci} 

S{e)={A„Q,C2} 



OBCACEDD 



S(d) = {A243,Q} 

S(e) = {A„CuC2} 

S{f) = {C2j)uE} 

S(g) = {C2,DuD2,E} S(g) = {C2,DuD2,E} 

Sih)={D2,E} 



(c) Vertical constraint graph (VCG) after net splitting. 
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Solutions to Chapter Exercises 



(d) Without net splitting, the instance is not routable because of the cycle D-E-D in 
the VCG from (a). With net splitting, the instance is routable. The minimum 
number of tracks needed = |5(c)| = \S{g)\ = 3. 

(e) Track assignment: 
currjrack = 1 

Consider nets Ax, A2 and Aj,. Assign Ax first because it is the leftmost in the zone 
representation. Nets Aj and ^3 do not cause a conflict. Therefore, assign ^2 and ^3 to 
currjrack. Remove nets ^1, A2 and^3 from the VCG. 

currjrack = 2 

Consider nets B and C2. Assign B first because it is the leftmost in the zone represen- 
tation. Net C2 does not cause a conflict. Therefore, assign C2 to cuiTjrack. Remove 
nets B and C2 from the VCG. 

currjrack = 3 

Consider nets Cx and Di. Assign Cx first because it is the leftmost in the zone repre- 
sentation. Net Dx does not cause a conflict. Therefore, assign Di to currjrack. Re- 
move nets Cx and Di from the VCG. 

currjrack = 4 and currjrack = 5 

Nets E and D2 are assigned to currjrack = 4 and currjrack = 5. 

The channel with routed nets is illusfrated below. 

AABOADCE 



currjrack = 1 


^ 






















" 





















A 












5 































QBCACEDD 



Exercise 3: Switchbox Routing 

The switchbox has six columns a-/ from left to right, and six tracks 1-6 from bottom 

to top. The following are the steps carried out at each column. 

Column a: Assign net F to track 1. Assign net B to track 6. Extend nets F (track 1), 

G (track 2), A (track 3) and B (track 6). 

Column b: Connect the top pin A and bottom pin A to net A on track 3. Extend nets 

F (track 1), G (track 2) and B (track 6). 
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Column c: Connect the bottom pin F to net F on track 1. Connect the top pin C to 

track 3. Extend nets G (track 2), C (track 3) and B (track 6). 

Column d\ Connect the bottom pin G to net G on track 2. Connect the top pin E to 

track 4. Extend nets G (track 2), C (track 3), i? (track 4) and B (track 6). 

Column e: Connect the bottom pin D with track 1. Connect the top pin B to net B on 

track 6. Assign net G to track 5. Extend nets D (track 1), C (track 3), E (track 4) and 

G (track 5). 

Column/: Connect the top pin D to net D on track 1. Extend the nets on tracks 2, 3, 

4 and 5 to their corresponding pins. 

Columns abode f 
A C E B D 



A F G D 

Exercise 4: Manufacturing Defects 

In a congested region, connections are more likely to detour, which increases the usage 
of vias. Therefore, vias are more likely to occur. Likewise, since wires are packed more 
closely, shorts are more likely in congested regions. Opens are also more likely because 
detoured connections are longer. Antenna effects are, a priori, less likely in congested 
regions because fewer connections use long straightline segments on low metal layers. 
However, routing congestion makes it more difficult to fix antenna violations. 

Exercise 5: Modern Cliallenges in Detailed Routing 

See reference [6.17] - G. Xu, L.-D. Huang, D. Pan and M. Wong, "Redundant-Via 
Enhanced Maze Routing for Yield Improvemenf , Proc. Asia and South Pacific 
Design Autom. Conf., 2005, pp. 1 148-1 151. 



6 

5 

u) 4 
o 
2 3 

2 

1 



B 
F 
A 
G 











































































— • 











Exercise 6: Non-Tree Routing 

One advantage of non-tree routing is that the redundant wires mitigate the impact of 
opens. However, redundant routing increases the total wirelength of the design. 
Excessive wirelength, especially in congested regions, pushes wires closer to each 
other, increasing the incidence of shorts. More information on non-tree routing can 
be found in [6.12] - A. Kahng, B. Liu and L Mandoiu, "Non-Tree Routing for Reli- 
ability and Yield hnprovemenf , Proc. Intl. Conf. on CAD, 2002, pp. 260-266. 
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Solutions to Chapter Exercises 



Chapter?: Specialized Routing 



Exercise 1: Net Ordering 

Without history costs, there is no indication how much demand a given edge or track 
has over time, hi contrast, congestion only shows how often a track is used in the 
recent past, e.g., the previous or current iteration. That is, without any previous 
knowledge of rip-up and reroute iterations, a router has very limited knowledge 
about which tracks are frequently used. Therefore, net ordering is the primary 
method in which to alleviate congestion. 

Exercise 2: Octilinear Maze Search 

Octilinear maze search is similar to BFS and Dijkstra's algorithm in the sense that it 
can find the shortest path between two points. However, octilinear maze search and 
BFS only run on graphs with equal edge weights, such as a grid, while Dijkstra's 
can run on graphs with non-negative weights. On a grid, octilinear maze search 
expands in eight directions, whereas BFS expands in four directions. Dijkstra's 
algorithm expands in all directions of outgoing edges. In general, on a grid, 
Dijkstra's algorithm is expanded in the four cardinal directions. 



Algorithm 


Input 


Output 


Runtime 


Octilinear 
maze search 


graph with equal 
weights 


Shortest path based on 
octilinear distance 


0(\V\ + \E\) 


BFS 


graph with equal 
weights 


Shortest path based on 
Manhattan distance 


Oi\V\+\E\) 


Dijkstra's 
algorithm 


graph with non- 
negative weights 


Shortest path based on 
edge weights 


Oi\V\log\E\) 



Exercise 3: Metliod of Means and Medians (MMM) 

(a) Clock tree constructed by MMM. 



X^(Si-S^)- 



yd^i-^s)- 



2 + 4 + 2 + 4 + 8 + 8 + 14 + 14 



56 



= 7 



2 + 2 + 12 + 12 + 8 + 6 + 8 + 6 



56 



Route So to Ml (7,7). After dividing by the 
median, Si-s^ are in one set, and Ss-Sg are in 
another set. 





1 i i 


„ , W^3. ^4: ^,„ 


















,%. : S7 




"1* ' 




,^6 : : Je 














: ^1: ^2^ : 






So 


ik ; :: i : 1 
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, , 2 + 4 + 2 + 4 12 , 
^c (^1 -^4) = -^ " T " 

, 2 + 2 + 12 + 12 28 ^ 
7.(^1-^4) = -^ = — = 7 

Route Ml to M2 (3,7). After dividing by the 
median, ^i and S2 are in one set, and Sj, and Sa, 
are in another set. 

, 8 + 8 + 14 + 14 44 ,, 

Xc (Ss -Sf,)= -^ " T " 

, 8 + 6 + 8 + 6 28 ^ 
yciss-s,) = -^ = — = 7 

Route Ml to M3 (11,7). After dividing by the 
median, ss and Si are in one set, and ^6 and s% 
are in another set. 











.S3 


.^4 
















"2 






.^5 


.^7 


"* „"3 


: ^' 








/^: r 
























t 


s. 








So 










ik 





2 + 4 6 

r ^ 2+2 4 






2 2 
2 + 4 6 



2 2 
12 + 12 24 



:12 

2 2 

Route M4 (3,2) and u^ (3,12) to M2. 

^c(^5'^7)=^— = Y = 11 

8 + 8 16 „ 



■14 22 



2 2 

6 + 6 12 



Xc{S(„S^) = ^^— ^ = ^^ = 11 
>'c(^6'^8)= 2 2 

Route M6 (1 1,8) and uq (1 1,6) to M3. 



Route sinks ^i and S2 to M4. 
Route sinks 53 and 54 to M5. 
Route sinks s^ and 57 to Wg. 
Route sinks ^g and Sf, to M7. 



« 


S3 


"^4 




^5 "6 ^7 


"2 ui K 




.Se ^7 .Sa 






^ 






< 


Sl 


U^S2 


So ■ ' 








A ; 1 



A 


if^4. ;.::::J::i:,j,,,:,L,,:],,:: 








S. "6 S, 


^2 "1 1^3 






.Se !"7 ,S8 




















^ 


t/^i'z 














So, 








A 1 
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Solutions to Chapter Exercises 



(b) Topology of the clock tree generated by MMM 




(c) Total wirelength and skew of the clock tree T generated by MMM. 
L{T) = total number of grid edges = 42. 

tLD{sQ,Si) = tLii_Si:„si) = tLiiso,s^) = ^10(^0,^4) = ^lz3(^o,^i-^4) = 16 
.sA:ew(r) = |^io(.so,^i-.j4) -^iz)(.Jo,^5-58)| = |16 - 14| = 2. 

Exercise 4: Recursive Geometric Matching (RGM) Algoritlim 

(a) Clock tree constructed by RGM. 



Perform min-cost geometric matching over 
sinks SrS^. 

Since C(si) = €{$2), tiiiT,) = hDiT,). There- 
fore, the tapping point ui is the midpoint of 
the segment (i'l-i'i). Similar arguments can 
be made for (53-54), {ss-s^) and {sts^). 



\ \ ' ': ] \ 








i 




^^z^tSa 


















































s 


"i^ 






^^ 




I" 


3 






^ 


\ 


1^ 


6 






,^8 


\ 












J 1 
























s^u^s. 














So 










A 











Perform min-cost geometric matching over 
internal nodes U\-u^. 

Since tio{T,) = tioiT,,), the tapping point M5 
is the midpoint of the segment (u^-uj)- Simi- 
lar arguments can be made for (m3~m4). 



II 






^SM 


































f^5 * 


S7 




f^5 


"3 "8 K 






}s .fs 




















.^1 


u^s^ 








So 


1 1 1 1 1 lA i 1 1 1 1 1 1 1 
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Perform min-cost, geometric matching over 
internal nodes M5 and U(,. For tapping point u^ 
on (iis-iie), 

tiD{us,UT) = L(us,uj), and ?lz)(m7,M6) = L{u-],U(). 
Since L(us,uj) + L(u-],U() = L(us,ue) = 8, 



ttoiTu^ = L{Us,Sl) + ?(5l) = Z(M5,52) + tiSj) = Z'(M5,^3) + ^(^3) = i(M5,i'4) + ^(^4) = 6. 

Combing all above equations, L{u=„u-]) = 3 andZ/(M7,M6) = 5. Route Sq to My. 
(b) Topology of the clock tree generated by RGM. 







,S.1"PS4 . 


























Sg -Sj 




"5 "7 \ 


ih u^ % 


: h ' f^ 




X.t.^A^-l . il- 






; : ■ i ^^ 








:: si"iS2 








So 


'▲ : 1 




(c) Total wirelength and skew of the clock tree T generated by MMM. 
L(T) = total number of grid edges = 39. 

^ tLD(So,S6) = (10(^0,^7) ^ ^ifll'^O^'^s) =16. 



Since the delay from Sq to all sinks is the same, skew(r) = 0. 
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Exercise 5: Deferred-Merge Embedding (DME) Algorithm 

(a) Merging segments. 



Find ms(u2) from sinks ^i and 
^3, and ms{u^) from sinks ^2 
and ^4. 





../.\ 


J 1 y^il :\ 






/ 


• ■ /^ 


■•■•. 


y' 




/ \ 


■■•>^ 


..•■' 


■■ 




/ 




,^^ 


I 


/ 


\ 


/ '"■■■■ / 




^ 


, \v^. 


.^4 


\ .•■' 






\ A\ 


'■'., 




j^./ \ - 




"\ 


ms(u2) 


1 ''■•. %••■' 






,.■•■ 


, ;s3^ 


, SoMi : 




\ 


MM lAi/l 





ms{u^ 



Find m.v(Mi) from merging 
segments ms{u2) and ms{u^. 



ms(u^ 



(b) Embedded clock tree. 

Connect si^ to u\. Based on 
trriux), connect msiuj) and 
msiui) to Ml. 







.•■■' 






.•■•■ 






V 






/ 




/1 . 


/ 




.•'' 


/ 




■y 


/ /4 




N 


W- - ' ' 


,■'*' 


X 






/ 


\ 








'■■... 


..-■' 






V" 3z 


So 


1 


i^ 1 














^2, 




/ 




.^1 . / 




"-■■■■■ / 




/ , •■•■•■/ .^4 




N 


J^1 l\ 




'■/. 






/ 












■•■■. 








So 




% ,\. 1 
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Based on trr{u2), route M2 to 
sinks S\ and s^. Based on 
tniuj), route M3 to sinks ^^2 
and S4. 



i 1 




1 1 


1 1 




j 1 


J : /\ % :.:. J.....I 


t / \ / 


X 


.//1 \ / 


.^ 


.. ■■' 


':<■ 


y . 






f V^ 




^? \ 


f' -' 


"3 ..-■■ 


: y 






■\ / 








/ 


■\/ 


\ 


.■•• 








So : : : ; 


; : % a: M : M M 



(0-0) + 0.M0- 0.3 



Exercise 6: Exact Zero Skew 

(a) Location of internal nodes u\,U2 and M3. 
Find X- and ^-coordinates for M2 - on segment between s\ and s^,. 
tEdsi) = 0, feflfe) = 0, Cisi) = 0.1, C(53) = 0.3, a = 0.1, p = 0.01 
L(sus,) = k, - x4 + [y„ -y4 = |2 - 4| + |9 - 1| = 10 

O.OMO ^ 

^^=^^0.7 
O.MO-(O.OMO + 0.1 + 0.3) 0.5 

Zs^-u, ■ L{si,S2) = 0.7 ■ 10 = 7, X- and j-coordinates for U2 = (4,4). 

Find the capacitance €(112)- 

Cisi) = 0. 1, Cisi) = 0.3, P = 0.01, Lisi,S3) = 10 

C(m2) = Cisi) + C(53) + P ■ ^(^1,^3) = 0.1 + 0.3 + 0.01 • 10 = 0.5 
Find the delay teoiTu)- 

tED(S\) = 0, ^££,(^3) = 0, Zj, 

R(si ~ M2) = a ■ z,,.„, ■ 1(^1,^3) = 0.1 ■ 0.7 ■ 10 = 0.7 
Cisi ~ 112) = P ■ z,, .„"^ ■ Z(5i,53) = 0.01 ■ 0.7 ■ 10 = 0.07 

R{u2~Si) = a ■ z„^_,, ■ Z(5i,53) = 0.1 ■ 0.3 ■ 10 = 0.3 
C{u2 ~ 53) = P ■ z«,--.3 ■ ^^1,^3) = 0.01 ■ 0.3 ■ 10 = 0.03 
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tEo{u^)^R{s,-u^)-\ '^^'';"^U c(.i)l + ^i.j(^i) = 0.7-(^ + Q.l| + 



= 0.0945 

Find the x- and j'-coordinates of M3 - on segment between ^^2 and ^•4. 
M^2)= 0, tEois,) = 0, Cfc) = 0.2, C(^4) = 0.2, a = 0.1, p = 0.01, Zfc,54) = 8 

0.01-8' 



(0-0) + 0.1-8- 0.2 
0.1-8-(0.01-8 + 0.2 + 0.2) 0.384 ~ 

2.5,~«3 ■ ^(•5'2,'5'4) = 0.5 ■ 8 = 4, X- and j'-coordinates for u^ = (10,7). 

Find the capacitance C{ui). 

Cfe) = 0.2, C{s^) = 0.2, P = 0.01, 1(^2,^4) = 8 

C(m3) = C(52) + C{sa) + P ■ L{s2,s^) = 0.2 + 0.2 + 0.01 • 8 = 0.48 

Find the delay tED{ui)- 

tEoisi) = 0, tED(s4) = 0, z,^.„, = 0.5, z„3..5^ = 1 - Z5j.„, = 1 - 0.5 = 0.5 

Ris2 ~ m) = a ■ z,, . ,„ ■ L{S2,S4) = 0. 1 ■ 0.5 ■ 8 = 0.4 
C(S2 ~ Mb) = P ■ z.,-,,, ■ ^^2,^4) = 0.01 ■ 0.5 ■ 8 = 0.04 

Riui ~S4) = a- z„^ _ .,^ ■ Z(.S2,^4) = 0. 1 ■ 0.5 ■ 8 = 0.4 
C(m3 ~ ^4) = P ■ z«,--., ■ ^^2,^4) = 0.01 ■ 0.5 ■ 8 = 0.04 



= Riu, ~ .4) • I ^ ^^ '' + Cis,) + t,^(s,) = 0.4 ■ I ^- + 0.2 J + 

= 0.088 

Find the x- and j'-coordinates of mi - on segment between U2 and M3. 
MW2) = 0.0945, MW3) = 0.088, C(m2) = 0.5, C(m3) = 0.48, a = 0. 1, P = 0.01 
Z(m2,M3) = K - Xuj + K - y4 = |4-10| + |4-7| = 9 
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^■L{u2,Ui) 



(^££.("3) - ^££.("2)) + a ■ L(u2,u^)-\ C(ii^) 

a- L{u2,Ut^)- (^- L{u2,Ut^) + C{u2) + C(m3)) 
0.01-9' 



(0.088 -0.0945) + 0.1- 9- 0.48 



0.466 



i 0.484 



0.1 -9 •(0.01-9 + 0.5 + 0.48) 0.963 

z„,-,„, ■ L{u2,u^) = 0.484 ■ 9 = 4.36, x- andj'-coordinates for vi = (8.36,4). 



(b) Elmore delay from ui to every sink S\-Sn. 

tE^ui) = 0.0945, tEoiu^) = 0.088, Qmj) = 0.5, Qmj) = 0.48, 

z«,-«, =z„,.„, = 0.484, z„,_,„=l-z„, -.„_ = 0.5 16 

R{ui ~ 112) = a ■ z„, .„^ ■ L(u2,U3) = 0.1 ■ 0.484 ■ 9 = 0.4356 
C(mi ~ M2) = P ■ z«, -«2 ■ Liu2,u^) = 0.01 ■ 0.484 ■ 9 = 0.04356 

7?(mi ~ M3) = a ■ z„,-.,„ ■ Z(m2,M3) = 0.1 ■ 0.516 ■ 9 = 0.4644 
C(mi ~ M3) = P ■ z„,--«3 ■ ^M2,M3) = 0.01 ■ 0.516 ■ 9 = 0.04644 

^ED ("1 !'5'r'5'4 ) = ^££> ("1 : •S'l ) = ^££) ("l , •5'3 ) 

= R(U,-U2)- (^i^flZjfll + C(«2 )1 + tEo («2 ) 



: 0.4356 -I "•"'^^^^ +0.5 1 + 0.0945 



■ ^££>("l5'S'2) - ^££)("l5'5'4) 
:7?(./l~»3)-f ^^"'~"^^ +C(«3)| + ^£Z^(»3) 



04644 
= 0.4644 -I + 0.48 1 + 0.088 



« 0.322 

Exercise 7: Bounded-Skew DME 

Feasible regions are marked as follows. 
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t 1 
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Solutions to Chapter Exercises 



Chapters: Timing Closure 



Exercise 1: Static Timing Analysis 

(a) Timing graph. 



a(0)— (0.75)^y(2) 
(0.1) 




(0.3) 



(0.2) 



(0.1) 



w(2)-(0.25)^f(0) 



/ 



(0.4) 



<!' 



(0)— (0.3)^-2(2) 



(b) Actual arrival times. 



AC 




.(0) 



a(0)— (0.75)^y(2) 

''AO J0-1)A3.4 

x(1) 
(0.15) A 1.3 



(0.1) 



(0.15)^ft(0) 
A 0.15^ 
(0.3) (0.2) 



w(2)-(0.25)^f(0) 
'A 5.5 A 5.75 



(0.4) 



c(0)— (0.3)^z(2) 
A 0.3 A 2.6 
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(c) Required arrival times. 




a(0)— (0.75)^y(2) 

R-0.1 J0-'')R2.65^ 

x(1) 



(0.1) 



(0.15)" 0-55 



^s^ (0.1 5)^ MO)" 
R-0.75\ R-0.6 

(0.3) (0.2) 




(^(0.25)->f(0) 
R4.75 R5 



(0.4) 



(d) Slacks. 



c(0)— (0.3)^z(2) 
R0.05 R2.35 




a(0)— (0.75)^y(2) 
(0-4-0.7 



(0.15)3-0.75 



(0.1) 



S -0.75\ S -0.75^ 

(0.3) (0.2) 



w(2)-(0.25)^f(0) 
^S -0.75 S -0.75 



(0.4) 



c(0)— (0.3)^>z(2) 
S -0.25 S -0.25 
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Solutions to Chapter Exercises 



Exercise 2: Timing-Driven Routing 

(a) Minimum spanning tree with radius(7) = 13 and cost(7) = 30. 




(b) Shortest-paths tree with radius(7) = 8 and cost(r) = 39. 
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(c) PD-tradeoff spanning tree (y = 0.5) with radius(r) = 9 and cost(7) = 35. 
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Exercise 3: Buffer Insertion for Timing Improvement 

With Fig. 8.12, the delay of each gate can be calculated with its load capacitance. 
Buffer J' always has a load capacitance of 2.5 fF. 

Buffer jv with size ^: v^ has load capacitance = 2.5 fF, which results in ^(v^) = 30 ps. 
AAT(c) = ?(vs) + t(yA) = 30 + 35 = 65 ps. 

Buffer >> with size B, vg has load capacitance = 3 fF, which results in ?(vs) = 33 ps. 
AAT{c) = ?(vfl) + tiye) = 33 + 30 = 63 ps. 

Buffer jv with size C: vg has load capacitance = 4 fF, which results in ({vb) = 39 ps. 
AATic) = ?(vfl) + tiyc) = 39 + 27 = 66 ps. 

The best size for buffer jv is B. 

Exercise 4: Timing Optimization 

1. Delay budgeting: assigning upper bounds on timing or length for nets. These 
limits restrict the maximum amount of time a signal travels along critical nets. 
However, if too many nets are constrained, this can lead to wirelength degrada- 
tion or highly-congested regions. 

2. Physical synthesis, such as gate sizing and cloning. Sizing up gates can improve 
the delay on specific paths at the cost of increased area and power. Cloning can 
mitigate long interconnect delays by duplicating gates or signals at locations 
closer to the desired location. 

Exercise 5: Cloning vs. Buffering 

Cloning is more advantageous than buffering when the same timing-critical signal is 
needed in multiple locations that are relatively far apart. The signal can just be re- 
produced locally. This can save on area and routing resources. Buffering can be 
more advantageous than cloning since buffers do not increase the upstream capaci- 
tance of the gate, which is helpful in terms of circuit delay and power. 

Exercise 6: Pliysical Syntliesis 

Buffer removal (vs. buffer insertion): If the placement or routing of the buffered net 
changes, some buffers may no longer be necessary to meet timing constraints. Alter- 
natively, buffers can be removed if the net is not timing-critical or has positive slack. 

Gate downsizing (vs. gate upsizing): If the path that goes through the gate can be 
slowed down without slack violations, then the gate can be downsized. 

Merging (vs. cloning): If the netlist, placement or routing of the design changes, 
some nodes could be removed due to redundancy. 

For all three transforms, increasing the area can cause illegality (overlap) in the 
placement. Therefore, the reverse transforms can be necessary to meet area con- 
straints or relax timing for non-critical paths. 
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